kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
125 stars 16 forks source link

Error during import of gz files #78

Closed steppo83 closed 3 weeks ago

steppo83 commented 2 years ago

Hello, I downloaded via torrent all gz files that are located here https://academictorrents.com/details/4dcfdf804775f2d92b7a030305fa0350ebef6f3e I tried to import them to biblio-glutton db via docker compose via this command at the end of Docker file: CMD java -jar lib/lookup-service-0.2-onejar.jar crossref --input /app/data/crossref-data/April2022 /app/config/config.yml

I have errors for all files, same error for all and the error says: ERROR [2022-09-22 09:42:10,531] com.scienceminer.lookup.reader.CrossrefJsonlReader: Some serious error when deserialize the JSON object: biblio-glutton-biblio-1 | }, biblio-glutton-biblio-1 | ! com.fasterxml.jackson.core.JsonParseException: Unexpected close marker '}': expected ']' (for root starting at [Source: (String)" },"; line: 1, column: 0]) biblio-glutton-biblio-1 | ! at [Source: (String)" },"; line: 1, column: 10] biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840) biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:712) biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.base.ParserBase._reportMismatchedEndMarker(ParserBase.java:1016) biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._closeScope(ReaderBasedJsonParser.java:2888) biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:683) biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4247) biblio-glutton-biblio-1 | ! at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2720) biblio-glutton-biblio-1 | ! at com.scienceminer.lookup.reader.CrossrefJsonlReader.fromJson(CrossrefJsonlReader.java:53) biblio-glutton-biblio-1 | ! at com.scienceminer.lookup.reader.CrossrefJsonlReader.lambda$load$0(CrossrefJsonlReader.java:34) biblio-glutton-biblio-1 | ! at java.util.Iterator.forEachRemaining(Iterator.java:116)

Can you help me? Thanks

kermitt2 commented 2 years ago

Hi @steppo83 !

It seems the torrent dump "envelope" format has changed. I tested the 2021 one, but not the 2022 (I am using metadata plus dump personally). Would you have the possibility to share the first 50 lines of the new torrent dump?

steppo83 commented 2 years ago

Hi @kermitt2, thanks for your answer! I can share first 3 files that are inside the folder, attached to this message. One question for my understanding: now I'm importing this dump https://archive.org/download/crossref_doi_dump_201909 . The dump that I mention before that now is not working is an alternative (more recent) of the single file above that I'm importing right now? Long story short: April 2022 Public Data File from Crossref is equal to crossref_doi_dump_201909 (only difference is that one has more recent data) - am I right? I'm a bit confused :) april2022.zip

Thanks!

lfoppiano commented 2 years ago

I can share first 3 files that are inside the folder, attached to this message. One question for my understanding: now I'm importing this dump https://archive.org/download/crossref_doi_dump_201909 .

This is the old one, yes.

The dump that I mention before that now is not working is an alternative (more recent) of the single file above that I'm importing right now? Long story short: April 2022 Public Data File from Crossref is equal to crossref_doi_dump_201909 (only difference is that one has more recent data) - am I right? I'm a bit confused :)

The format is different but for the overlapping part (up till 2019) it's the same data. Yeah it's confusing for everybody having so many formats...

steppo83 commented 2 years ago

Ok, now makes sense.

Thanks for the information! Hope for un update of the code @kermitt2 :)

lfoppiano commented 2 years ago

OK, I think I found the problem... the format has not changed, but the formatting of the JSON is different.

So the trick to guess if it's a JSONArray fails (CrossrefJsonReader.java:isJsonArray()) as we check just the first line (to avoid parsing the whole JSON all over)

I should be able to push a PR quickly

steppo83 commented 2 years ago

@lfoppiano got it! Thanks for you work :)

lfoppiano commented 1 year ago

I've started having a look at the indexing and there the changes are more difficult (mostly because I don't know well js).

I'd suggest that we should reduce the number of formats we support. How about to deprecate the old 2019 and 2021 and keep only the latest ones, dump 2022 and the various gap / premium crossref?

@kermitt2 what do you think?

steppo83 commented 1 year ago

Hello Luca! I think would be good since 2019/2020/2021 are old and we should take the latest one. Good for me!

Thanks, Stefano

From: "Luca Foppiano" @.> To: "kermitt2/biblio-glutton" @.> Cc: "gabriele stefano" @.>, "Mention" @.> Sent: Friday, October 14, 2022 1:42:45 AM Subject: Re: [kermitt2/biblio-glutton] Error during import of gz files (Issue #78)

I've started having a look at the indexing and there the changes are more difficult (mostly because I don't know well js).

I'd suggest that we should reduce the number of formats we support. How about to deprecate the old 2019 and 2021 and keep only the latest ones, dump 2022 and the various gap / premium crossref?

— Reply to this email directly, [ https://github.com/kermitt2/biblio-glutton/issues/78#issuecomment-1278289203 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AEVAAEJTYPLNVBIF5NCYTWTWDCM7LANCNFSM6AAAAAAQS3YXUA | unsubscribe ] . You are receiving this because you were mentioned. Message ID: < @.*** >