kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
125 stars 16 forks source link

public crossref dump compliance #54

Closed lfoppiano closed 3 years ago

lfoppiano commented 3 years ago

CC @Aazhar

This PR allows support of the public crossref dump and the older greenelab (from 2018). Originally derived from #52.

In addition:

lfoppiano commented 3 years ago

@achrafazharccsd I've run the import on the crossref file and I've got

Metadata Lookup Crossref size | "{crossref_Jsondoc=105687693}"

I expected it to be 120M instead...

The previous import (2018) resulted in

Metadata Lookup Crossref size | "{crossref_Jsondoc=96491709}"

The total doesn't really adds up. How many document did you count (you can get it by using GET /service/data)

kermitt2 commented 3 years ago

We don't import/index "component DOI", which are DOI for figures, tables, etc. We keep only DOI relevant at document-level.

Aazhar commented 3 years ago

@lfoppiano yes we ignore records of type component, check : https://github.com/kermitt2/biblio-glutton/blob/master/lookup/src/main/java/com/scienceminer/lookup/reader/CrossrefJsonReader.java#L60

using the crossref snapshot from february I got : crossref_Jsondoc=117417154

lfoppiano commented 3 years ago

OK. The numbers are not adding up IMHO... How can be that from 2018 to 2021 the increase was from 96M to 105M (11M) and from 2021 Jan to 2021 Feb is more than 10M? 🤷

Aazhar commented 3 years ago

I think it s because from january 2021, I've been notified that elsevier has decided to open its references and there was an increase of about 15%

kermitt2 commented 3 years ago

mmm references does not increase the number of DOI entries.

I try to summarize:

date total DOI loaded
2018-01 107M 96.5M
2019-09 ? 104.5M
2020 112M ?
2021-01 120M 105.7M
2021-02 122M 117M
now (not dump) 124M ?

Number of DOI components is around 5M, so 117M loaded records in February would be indeed correct and 105 for January would be too low. In general we were loading too few DOI apparently given that we ignored only components.

Could we have silent fails with the JSON deserialization? I really don't trust marshalling for robustness :) We had this issue repeatedly with Unpaywall with the updated import always stupidly failing because a new attribute was added, although not used.

lfoppiano commented 3 years ago

If the JSON deserialisation fails, the exceptions are logged but not caught, so it should stop the process. The process is also failing because of unknown properties, but this can be changed by modifying the configuration of the ObjectMapper:

mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, true);

We could make this a parameter in the configuration... 😉

I looked for additional possible silent failing, and I added some logs when a record is ignored because does not pass validation (e..g no doi or type== component)...

lfoppiano commented 3 years ago

After several tests I added an additional meter that marks the number of records that are invalid or that are not added. This should give us some sort of information of how many are rejected.

If a file is invalid the process will stop, it's a bit too conservative but at least we are sure not to make obvious mistakes... of course we might want to change this

lfoppiano commented 3 years ago

I've ran another import:

-- Meters ----------------------------------------------------------------------
crossrefLookup
             count = 105669693
         mean rate = 9032.29 events/second
     1-minute rate = 7818.76 events/second
     5-minute rate = 6660.71 events/second
    15-minute rate = 5613.19 events/second
crossrefLookup_invalidRecords
             count = 2633

Here what the script counted, there is a little discrepancy with what is saved:

Crossref lookup size {crossref_Jsondoc=105687693} records.

.. and the result from the API

Crossref lookup size {crossref_Jsondoc=105687693} records.

However ... I'm not sure where the rest has gone...

I'm really confused 🤔

lfoppiano commented 3 years ago

OK I found the problem and finally, the data adds up. 😄

The total is 112,400,118

-- Counters --------------------------------------------------------------------
crossrefLookup_invalidRecords
             count = 4000092

-- Meters ----------------------------------------------------------------------
crossrefLookup
             count = 108400026
         mean rate = 8989.00 events/second
     1-minute rate = 13651.89 events/second
     5-minute rate = 8448.64 events/second
    15-minute rate = 7776.67 events/second

And the counter is now synchronised with the number of records retrieved from the DB:

Crossref lookup size {crossref_Jsondoc=108502026} records.