lappsgrid-incubator / galaxy-paper-rank

Project home for the paper identification project for Galaxy.
Apache License 2.0
0 stars 0 forks source link

Training Data #10

Open ksuderman opened 3 years ago

ksuderman commented 3 years ago

Most of the low hanging fruit has been harvested from the Zotero spreadsheet and can be found in /data/corpura/curation/gold on the Lappsgrid server. These files are (should be) all true positives. For training purposes we also need negative examples.

Possible sources of negative examples

  1. Zotero spreadsheet
    • query crossref.org to find DOI record from title and authors
    • check if DOI record has a download link for TDM
  2. Query crossref.org
    • find DOI for all articles that contain the string "galaxy"
    • remove articles that appear in the Zotero spreadsheet
    • How do we determine if remaining articles are negatives or just positives we just don't know about yet?

@nancyide please provide feedback on the size of the training set we should aim for and the ratio of positives and negatives.

nancyide commented 3 years ago

The corpus should be as big as we can possibly get it to be. Positive and negative examples should be in the ratio of 1:1.