howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Bootstrapping planning #470

Open jameshowison opened 6 years ago

jameshowison commented 6 years ago

https://public.etherpad-mozilla.org/p/bootstraping_planning

jameshowison commented 6 years ago

Bootstraping software mention detection:

Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871. https://doi.org/10.1016/j.joi.2015.07.012

They use a white list of software names to find mentions then train machine learning with that. Because they don't have a gold standard set they don't have a way of knowing what mentions they might have missed.

RQ: can we reduce the amount we read and still find sufficient software mentions?

  1. Convert PDFs to text.
  2. Use a "whitelist" of software names to search the fulltext (created by randomly sampling from the names of software that we did find).
  3. Expand a window around the places those names were found (e.g. +/- 1 page, or +/- 1% of the paper?).
  4. What proportion of mentions that we found with full paper content analysis would we catch if we just looked at those windows?
  5. Re-do steps 2-4 some number of times (statistical bootstrapping) to get a confidence interval around the proportion found.
  6. Play with size of random sample in step 2, and size of window in step 3, recalculating proportion found each time. draw a graph with sample size on vertical, window size on horizontal.
kermitt2 commented 6 years ago

In the GROBID entity recognition modules and entity-fishing, in particular for recognizing scientific quantities, astronomical entities and named entities, we use a different bootstrapping approach which is machine learning-driven:

  1. annotation of a very small amount of training data from XML generated from PDF (to be sure that the XML will match the PDF), for instance one document
  2. training of a first model
  3. generating training data by converting PDF to XML with annotations produced by the first model
  4. correction/completion of the generated training data by annotators
  5. re-training of a new model and evaluation
  6. back to 3 until we are satisfied with the evaluation :)

The motivation is that machine learning is super sensitive to the quality of training data. If we train with incomplete/erroneous training data, we have bad results. It's more efficient to train with very small high quality training data than with a lot of incomplete/low quality annotated data (as we could get with a simple white list of software names).

The second reason is to ensure that the training data, here XML, is exactly aligned with the content of the PDF given the tool that will then process the PDF. Cut and paste of PDF raises problems, because the copied text will depend on the PDF viewer (which is changing the order of PDF elements as compared to the PDF element stream order). The latest macOS PDF viewer for instance post-processes significantly the content of the PDF (recomposing characters and even performing some OCR to solve placeholder UTF-8 codes for embedded glyphs).