howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Create annotator tasks for a HEP collection #673

Open kermitt2 opened 3 years ago

kermitt2 commented 3 years ago

We would like to create annotator tasks from a new collection of Open Access papers in the Hight Energy Particle (HEP) field. HEP is interesting because it is almost entirely available in OA (publication are founded by the SCOAP³ initiative), it is complementary to the fields we already cover and it relies significantly on scientific software.

The foreseen process is as follow:

All the tools of this pipeline should be already available.

kermitt2 commented 3 years ago

Easier process, no need to use OAI-PMH for the first point:

We could probably only consider hep-ex for software, which gives 19,527 entries (11,768 with a DOI).

kermitt2 commented 3 years ago

The prepared JSON files, segmented at sentence level and with candidate entities generated by the SciBERT+CRF fine-tuned model are under https://github.com/howisonlab/softcite-dataset/tree/master/data/hep_json_service_scibert

I think the quality and recall is better than with whitelist (https://github.com/howisonlab/softcite-dataset/tree/master/data/hep_json_whitelist) and crf (https://github.com/howisonlab/softcite-dataset/tree/master/data/hep_json_service-crf).

Note that I have not removed the "version" attribute entity alone (without software name around for the attachment), because they are good indication in general that there is a software name in proximity to be annotated.

And I have prepared a "blacklist" file for HEP (with mainly names of collaborations/experiments/instruments) here: https://github.com/howisonlab/softcite-dataset/blob/master/data/software_lists/stop/manual_stop_hep_patrice.csv

kermitt2 commented 3 years ago

The article I used are the articles from the hep-ex set (3433 articles).

set_full_name set_name total with_doi with_doi_post_2015 with_oa_url_non_arXiv_post_2015
Experiment hep-ex 19527 11768 5064 3433
Lattice hep-lat 15756 10504 3409 1672
Phenomenology hep-ph 113263 83055 31996 14917
Theory hep-th 90700 71012 27117 12959

see https://github.com/howisonlab/softcite-dataset/blob/master/data/hep-sets.md