kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
116 stars 14 forks source link

Fail to segment the PMC oa list file: |2024-06-20 05:36:50| nb tokens: 1 #96

Open lfoppiano opened 2 weeks ago

lfoppiano commented 2 weeks ago

I've ran the import, in the following order: crossref (I loaded around 96M records, before I had to stop because I ran out of space), HAL and then PMID, and I've got the following exception when running ./gradlew pmid.

ERROR [2024-06-20 10:07:05,722] com.scienceminer.glutton.storage.lookup.PMIdsLookup: Fail to segment the PMC oa list file: |2024-06-20 05:36:50| nb tokens: 1
! java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 1
! at com.scienceminer.glutton.storage.lookup.PMIdsLookup.loadFromFileExtra(PMIdsLookup.java:131)
! at com.scienceminer.glutton.command.LoadPMIDCommand.run(LoadPMIDCommand.java:97)
! at com.scienceminer.glutton.command.LoadPMIDCommand.run(LoadPMIDCommand.java:29)
! at io.dropwizard.core.cli.ConfiguredCommand.run(ConfiguredCommand.java:98)
! at io.dropwizard.core.cli.Cli.run(Cli.java:78)
! at io.dropwizard.core.Application.run(Application.java:94)
! at com.scienceminer.glutton.web.LookupServiceApplication.main(LookupServiceApplication.java:200)

Full log

(base) lfoppiano@grobid-eval:~/biblio-glutton$ ./gradlew pmid

> Task :pmid
WARN  [2024-06-20 10:06:15,169] org.hibernate.validator.internal.properties.javabean.JavaBeanExecutable: HV000254: Missing parameter metadata for ResponseMeteredLevel(String, int), which declares implicit or synthetic parameters. Automatic resolution of generic type information for method parameters may yield incorrect results if multiple parameters have the same erasure. To solve this, compile your code with the '-parameters' flag.
Downloading https://ftp.ebi.ac.uk/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv.gz ...
Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt ...
INFO  [2024-06-20 10:06:41,793] com.scienceminer.glutton.command.LoadPMIDCommand: Preparing the system. Loading data for PMID from data/pmc/PMID_PMCID_DOI.csv.gz
6/20/24, 10:06:56 AM ===========================================================

-- Meters ----------------------------------------------------------------------
pmidLookup
             count = 1507241
         mean rate = 101762.22 events/second
     1-minute rate = 84847.91 events/second
     5-minute rate = 82588.72 events/second
    15-minute rate = 82197.33 events/second

INFO  [2024-06-20 10:07:05,717] com.scienceminer.glutton.storage.lookup.PMIdsLookup: Cross checking number of records processed:: 2094540
INFO  [2024-06-20 10:07:05,720] com.scienceminer.glutton.command.LoadPMIDCommand: PubMed lookup loaded {pmid_doi2ids=946688, pmid_pmc2ids=850087, pmid_pmid2ids=1287533} records. 
ERROR [2024-06-20 10:07:05,722] com.scienceminer.glutton.storage.lookup.PMIdsLookup: Fail to segment the PMC oa list file: |2024-06-20 05:36:50| nb tokens: 1
! java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 1
! at com.scienceminer.glutton.storage.lookup.PMIdsLookup.loadFromFileExtra(PMIdsLookup.java:131)
! at com.scienceminer.glutton.command.LoadPMIDCommand.run(LoadPMIDCommand.java:97)
! at com.scienceminer.glutton.command.LoadPMIDCommand.run(LoadPMIDCommand.java:29)
! at io.dropwizard.core.cli.ConfiguredCommand.run(ConfiguredCommand.java:98)
! at io.dropwizard.core.cli.Cli.run(Cli.java:78)
! at io.dropwizard.core.Application.run(Application.java:94)
! at com.scienceminer.glutton.web.LookupServiceApplication.main(LookupServiceApplication.java:200)
6/20/24, 10:07:11 AM ===========================================================

-- Meters ----------------------------------------------------------------------
pmidLookup
             count = 2094540
         mean rate = 70200.77 events/second
     1-minute rate = 81884.28 events/second
     5-minute rate = 82109.67 events/second
    15-minute rate = 82045.09 events/second
pmidLookupExtra
             count = 123960
         mean rate = 20075.23 events/second
     1-minute rate = 18892.00 events/second
     5-minute rate = 18892.00 events/second
    15-minute rate = 18892.00 events/second

6/20/24, 10:07:26 AM ===========================================================

-- Meters ----------------------------------------------------------------------
pmidLookup
             count = 2094540
         mean rate = 46712.91 events/second
     1-minute rate = 63771.54 events/second
     5-minute rate = 78105.14 events/second
    15-minute rate = 80689.00 events/second
pmidLookupExtra
             count = 318952
         mean rate = 15061.20 events/second
     1-minute rate = 17701.31 events/second
     5-minute rate = 18631.79 events/second
    15-minute rate = 18803.95 events/second

warning: total PMC references from OA file not found in DOI/PMC mapping file:5800662
INFO  [2024-06-20 10:07:27,616] com.scienceminer.glutton.storage.lookup.PMIdsLookup: Cross checking number of records processed:: 321050
INFO  [2024-06-20 10:07:27,617] com.scienceminer.glutton.command.LoadPMIDCommand: PubMed lookup extra infos loaded in {pmid_doi2ids=946688, pmid_pmc2ids=850087, pmid_pmid2ids=1287533} records. 
INFO  [2024-06-20 10:07:27,618] com.scienceminer.glutton.command.LoadPMIDCommand: Cleaning downloaded resource files
INFO  [2024-06-20 10:07:27,620] com.scienceminer.glutton.command.LoadPMIDCommand: Finished in 45 s

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1m 16s
3 actionable tasks: 1 executed, 2 up-to-date
(base) lfoppiano@grobid-eval:~/biblio-glutton$ 
kermitt2 commented 2 weeks ago

Hi @lfoppiano

It is likely that the download of https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt somehow failed, has error or was interrupted. I think in case of failure, the downloaded file is still under data/pmc/oa_file_list.txt, if it's the case you can do a tail on this file to see if the last line is broken. I am using apache FileUtils for the download, so maybe something more robust could help.

One solution, you can just rerun the command - it might work at some point if you have a good internet connection :)