Open kermitt2 opened 3 years ago
Easier process, no need to use OAI-PMH for the first point:
zipgrep -h 'categories":"hep-ex' compressed_arxiv-metadata-oai-snapshot.json.zip > hep-ex.json
We could probably only consider hep-ex
for software, which gives 19,527 entries (11,768 with a DOI).
The prepared JSON files, segmented at sentence level and with candidate entities generated by the SciBERT+CRF fine-tuned model are under https://github.com/howisonlab/softcite-dataset/tree/master/data/hep_json_service_scibert
I think the quality and recall is better than with whitelist (https://github.com/howisonlab/softcite-dataset/tree/master/data/hep_json_whitelist) and crf (https://github.com/howisonlab/softcite-dataset/tree/master/data/hep_json_service-crf).
Note that I have not removed the "version" attribute entity alone (without software name around for the attachment), because they are good indication in general that there is a software name in proximity to be annotated.
And I have prepared a "blacklist" file for HEP (with mainly names of collaborations/experiments/instruments) here: https://github.com/howisonlab/softcite-dataset/blob/master/data/software_lists/stop/manual_stop_hep_patrice.csv
The article I used are the articles from the hep-ex
set (3433 articles).
set_full_name | set_name | total | with_doi | with_doi_post_2015 | with_oa_url_non_arXiv_post_2015 |
---|---|---|---|---|---|
Experiment | hep-ex | 19527 | 11768 | 5064 | 3433 |
Lattice | hep-lat | 15756 | 10504 | 3409 | 1672 |
Phenomenology | hep-ph | 113263 | 83055 | 31996 | 14917 |
Theory | hep-th | 90700 | 71012 | 27117 | 12959 |
see https://github.com/howisonlab/softcite-dataset/blob/master/data/hep-sets.md
We would like to create annotator tasks from a new collection of Open Access papers in the Hight Energy Particle (HEP) field. HEP is interesting because it is almost entirely available in OA (publication are founded by the SCOAP³ initiative), it is complementary to the fields we already cover and it relies significantly on scientific software.
The foreseen process is as follow:
hep-ex
,hep-th
, etc.)code/corpus/
(and Grobid) to get JSON representation with software mentionsAll the tools of this pipeline should be already available.