Build a data pipeline for running unlabelled papers through the labeller

bruffridge commented 3 years ago

Get paper data for the ~13mil biology papers from MAG API. See https://github.com/nasa-petal/PeTaL-labeller/issues/65 for more details about interfacing with the API.

This is the API request to get all biology papers from MAG API. https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=And(Ty='0',Or(Composite(J.JN=='biomimetics'), Composite(F.FN=='biology')))&model=latest&count=10&offset=0&attributes=Id,DOI,Ti,VFN,F.FN,AA.AuId,AW,RId

These will be the papers we will run through our labeller. If 13 mil papers is technically infeasible or difficult to process (@elkong can help weigh in on this) let me know and we can figure out how to filter this down further.

Transform paper data into the format expected by MATCH.

put them in this JSON format

Example:

{
  "paper": "2133743025",
  "doi": "10.1016/J.CUB.2007.07.011"
  "mag": [
    "microtubule_polymerization", "microtubule", "tubulin", "guanosine_triphosphate", "growth_rate", "gtp'", "optical_tweezers", "biophysics", "dimer", "biology"
  ],
  "venue": "Current biology",
  "author": [
    "2305659199", "2275630009", "2294310593", "1706693917", "2152058803"
  ],
  "reference": [
    "2002430130", "2089645884", "1848121837"
  ],
  "text": "microtubule assembly dynamics at the nanoscale background the labile nature of microtubules is critical for establishing cellular morphology and motility yet the molecular basis of assembly remains unclear here we use optical tweezers to track microtubule polymerization against microfabricated barriers permitting unprecedented spatial resolution",
  "label": []
}

paper = MAG Paper Id mag = MAG Normalized field of study name F.FN (all lowercase, spaces replaced by underscores) venue = MAG Venue full name VFN author = array of MAG author ids. reference = array of MAG paper ids text = title + abstract. (tokenize the text, remove all punctuation, and convert all characters to lowercase) label = empty array

How to clean text in python: https://machinelearningmastery.com/clean-text-machine-learning-python/

Then put the JSON on a single line per paper in a .json file

{ "paper": "2133743025","venue": "Current biology",...}
{ "paper": "2002430130","venue": "Journal of Experimental Biology",...}
...

bruffridge commented 3 years ago

@dsmith111 just created a script that may help convert data into the MATCH format. https://github.com/nasa-petal/PeTaL-labeller/tree/main/scripts/lens-cleaner

Simarkohli24 commented 3 years ago

@bruffridge do we need any follow-ups/further scripting to this issue? I know migrating to lambda was mentioned

bruffridge commented 3 years ago

Eventually the plan is for this to run in lambda. For now, we just need a python script that downloads the json on every paper from MAG, and transforms them into the format expected by MATCH.

bruffridge commented 3 years ago

Some metrics for consideration:

5000 papers per API request (or it might timeout based on my testing). ~7 seconds per API request ~10.81MB per request.

Total: ~5 hours, ~28gb, and 2600 requests to pull down 13,000,000 biology papers using MAG API.

bruffridge commented 3 years ago

The papers returned are consistently ordered between requests even without a sortby parameter, which means we can probably use the offset and limit parameters to chunk through the full dataset.

1st API request:

limit: 5000
offset: 0

2nd API request:

limit: 5000
offset: 5000

3rd API request:

limit: 5000
offset: 10000

~ 5 hours later

2600th API request:

limit: 5000
offset: 12,995,000

nasa-petal / PeTaL-labeller

Build a data pipeline for running unlabelled papers through the labeller #45