howisonlab / screenit-softcite

Creative Commons Zero v1.0 Universal
2 stars 2 forks source link

issues with harvester #1

Closed jameshowison closed 1 year ago

jameshowison commented 1 year ago

@kermitt2 I wonder if you might be able to help here. I'm trying to get a from scratch pipeline up to process this very small group of DOIs (~1000). I'm not sure of the source of the issue I'm running into, I think it's getting the c language dependencies in place for the python libraries.

My hope is to get this working with repo2docker (rather than the pre-packaged containers), so packages that need to be installed via apt-get should be in apt.txt and the python dependencies in requirements.txt

You can see the current issue I'm getting at https://github.com/howisonlab/screenit-softcite/blob/main/screen_it_pipeline.ipynb

I think this is really a question of aligning how you've been installing the dependencies and how repo2docker does it. https://repo2docker.readthedocs.io/en/latest/config_files.html. Certainly this is just how I'd prefer to package, not that the containers etc don't work :)

Any help welcomed!

kermitt2 commented 1 year ago

Hi @jameshowison !

There are two problems actually.

1) There's a default maximum of 126 reader for LMDB - so somehow it makes sense that it fails at the 126th DOI ! It directed me to a likely bug with LMDB transactions not properly closed in case of failure. I think I have introduced this bug in my latest version of this harvester (not enough tests).

I have fixed it in a new release 0.2.4 (it fixes a possible problem with ftp download too):

python3 -m pip install article-dataset-builder==0.2.4

2) One another problem, the source of the failures, harvester.harvest_dois() takes a file as input with one DOI per line. data/comparison_full_set.csv has only PMC identifier, and they are not one per line. The whole line is taken as a DOI I think.

So you would need to use harvester.harvest_pmcids() for this small dataset and only have the PMC ids in the line. In attachment is the data/comparison_full_set_pmc.txt file with just first column from data/comparison_full_set.csv. With this file, I have a working harvesting of these 1500 PMC files.

comparison_full_set_pmc.txt

Not related: the harvester download PDF and JATS XML files. After harvesting, you can transform all the JATS files into TEI files in the data path via:

nlm2tei = Nlm2tei(config_path=config_path)
nlm2tei.process()

See https://github.com/kermitt2/article_dataset_builder#converting-the-pmc-xml-jats-files-into-xml-tei

Sending TEI files to the software mention recognizer is much faster than sending JATS files, which requires a transformation of the individual files - it's taking 2 seconds to load the XSLT. Transforming the JATS files in batch loads only the XSLT file one time for all. But for 1500 files, the runtime is not so crucial.

jameshowison commented 1 year ago

Harvester now worked well, thanks! Continued in #2