nasa-petal / PeTaL-labeller

The PeTaL labeler labels journal articles with biomimicry functions.
https://petal-labeller.readthedocs.io/en/latest/
The Unlicense
6 stars 3 forks source link

Prepare train/test data for MATCH #43

Closed bruffridge closed 3 years ago

bruffridge commented 3 years ago

Take our 1375 labelled papers and put them in this JSON format

Example:

{
  "paper": "020-134-448-948-932",
  "mag": [
    "microtubule_polymerization", "microtubule", "tubulin", "guanosine_triphosphate", "growth_rate", "gtp'", "optical_tweezers", "biophysics", "dimer", "biology"
  ],
  "mesh": [
    "D048429", "D000431"
  ],
  "venue": "Current biology",
  "author": [
    "2305659199", "2275630009", "2294310593", "1706693917", "2152058803"
  ],
  "reference": [
    "020-720-960-216-820", "052-873-952-181-099", "000-849-951-902-070"
  ],
  "scholarly_citations": [
    "000-393-690-357-939", "000-539-388-379-773", "002-134-932-426-244"
  ],
  "text": "microtubule assembly dynamics at the nanoscale background the labile nature of microtubules is critical for establishing cellular morphology and motility yet the molecular basis of assembly remains unclear here we use optical tweezers to track microtubule polymerization against microfabricated barriers permitting unprecedented spatial resolution",
  "label": [
    "change_size_or_color", "move", "physically_assemble/disassemble", "maintain_ecological_community"
  ]
}

JSON field mappings to lens.org fields: https://support.lens.org/scholar-field-definitions/

paper = lens_id or DOI (whatever ID is chosen must be used for references as well) mag = Lens.org fields_of_study (all lowercase, spaces replaced by underscores) mesh = Lens.org mesh_terms.mesh_ids venue = Lens.org Source Title author = array of MAG ids. reference = array of lens_ids or DOIs scholarly_citations = array of scholarly_citations lens_ids. text = title + abstract. (tokenize the text, remove all punctuation, and convert all characters to lowercase) label = array of biomimicry functions (all lowercase, spaces replaced by underscores)

How to clean text in python: https://machinelearningmastery.com/clean-text-machine-learning-python/

Then put the JSON on a single line per paper in a .json file

{ "paper": "020-134-448-948-932","venue": "Current biology",...}
{ "paper": "234-235-384-291-673","venue": "Journal of Experimental Biology",...}
...