Most of the low hanging fruit has been harvested from the Zotero spreadsheet and can be found in /data/corpura/curation/gold on the Lappsgrid server. These files are (should be) all true positives. For training purposes we also need negative examples.
Possible sources of negative examples
Zotero spreadsheet
query crossref.org to find DOI record from title and authors
check if DOI record has a download link for TDM
Query crossref.org
find DOI for all articles that contain the string "galaxy"
remove articles that appear in the Zotero spreadsheet
How do we determine if remaining articles are negatives or just positives we just don't know about yet?
@nancyide please provide feedback on the size of the training set we should aim for and the ratio of positives and negatives.
Most of the low hanging fruit has been harvested from the Zotero spreadsheet and can be found in
/data/corpura/curation/gold
on the Lappsgrid server. These files are (should be) all true positives. For training purposes we also need negative examples.Possible sources of negative examples
@nancyide please provide feedback on the size of the training set we should aim for and the ratio of positives and negatives.