kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
125 stars 16 forks source link

Unrecognized field "journal_issn_l" in recent dump of Unpaywall #37

Closed cverluise closed 2 years ago

cverluise commented 5 years ago

Hello,

first, thanks for the truly awesome work!

Issue

I am building the embedded LMDB database and was trying to add the Unpaywall LookUp.

The program starts but keeps raising exceptions com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "journal_issn_l" (detailed error message below).

```shell ! com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "journal_issn_l" (class com.scienceminer.lookup.data.UnpayWallMetadata), not marked as ignorable (19 known properties: "journal_is_in_doaj", "genre", "oaStatus", "journal_issns", "is_oa", "openAccess", "oa_locations", "data_standard", "journal_name", "title", "updated", "publisher", "year", "doi", "journal_is_oa", "best_oa_location", "doi_url", "published_date", "oa_status"]) ! at [Source: (String)"{"doi": "10.1007/bf03160334", "year": 1914, "genre": "journal-article", "is_oa": false, "title": "Barroisia und die Pharetronenfrage", "doi_url": "https://doi.org/10.1007/bf03160334", "updated": "2018-06-17T04:42:28.895386", "oa_status": "closed", "publisher": "Springer Nature", "z_authors": [{"given": "H.", "family": "Rauff"}], "journal_name": "Paläontologische Zeitschrift", "oa_locations": [], "data_standard": 2, "journal_is_oa": false, "journal_issns": "0031-0220", "journal_issn_l": "0031-022"[truncated 90 chars]; line: 1, column: 493] (through reference chain: com.scienceminer.lookup.data.UnpayWallMetadata["journal_issn_l"]) ! at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61) ! at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:823) ! at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1153) ! at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1589) ! at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1567) ! at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:294) ! at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151) ! at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4013) ! at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3004) ! at com.scienceminer.lookup.reader.UnpayWallReader.fromJson(UnpayWallReader.java:58) ! at com.scienceminer.lookup.reader.UnpayWallReader.lambda$load$1(UnpayWallReader.java:42) ! at java.util.Iterator.forEachRemaining(Iterator.java:116) ! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ! at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) ! at com.scienceminer.lookup.reader.UnpayWallReader.load(UnpayWallReader.java:41) ! at com.scienceminer.lookup.storage.lookup.OALookup.loadFromFile(OALookup.java:110) ! at com.scienceminer.lookup.command.LoadUnpayWallCommand.run(LoadUnpayWallCommand.java:66) ! at com.scienceminer.lookup.command.LoadUnpayWallCommand.run(LoadUnpayWallCommand.java:22) ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:87) ! at io.dropwizard.cli.Cli.run(Cli.java:78) ! at io.dropwizard.Application.run(Application.java:93) ! at com.scienceminer.lookup.web.LookupServiceApplication.main(LookupServiceApplication.java:68) ERROR [2019-08-21 01:31:36,510] com.scienceminer.lookup.reader.UnpayWallReader: The input line cannot be processed {"doi": "10.3886/icpsr02766", "year": null, "genre": "dataset", "is_oa": false, "title": "Project on Human Development in Chicago Neighborhoods: Community Survey, 1994-1995", "doi_url": "https://doi.org/10.3886/icpsr02766", "updated": "2018-06-18T23:27:05.481519", "oa_status": "closed", "publisher": "Inter-university Consortium for Political and Social Research (ICPSR)", "z_authors": [{"given": "Felton J.", "family": "Earls"}, {"given": "Jeanne", "family": "Brooks-Gunn"}, {"given": "Stephen W.", "family": "Raudenbush"}, {"given": "Robert J.", "family": "Sampson"}], "journal_name": "ICPSR Data Holdings", "oa_locations": [], "data_standard": 2, "journal_is_oa": false, "journal_issns": null, "journal_issn_l": null, "published_date": null, "best_oa_location": null, "journal_is_in_doaj": false, "has_repository_copy": false} ```

How to reproduce the behaviour

java -jar build/libs/lookup-service-1.0-SNAPSHOT-onejar.jar unpaywall --input ~/data/unpaywall_snapshot_2019-08-16T155437.jsonl.gz data/config/config.yml

Note: as you can see the Unpaywall dataset that I am using is more recent that the one used in the biblio-glutton demo.

Environment

lfoppiano commented 5 years ago

@cverluise thanks for reporting this issue. I ignore the field journal_issn_l when unmashaling the json file. Commit 768283990d02ff7ba355dc2a1ab4f41f8de97080

I didn't tested it. Could you please check whether the load works?

There might be other fields that needs to be changed, if you want to fix it on the fly, you could just add the missing field in the @JsonIgnoreProperties annotation on top of the classes where the failure is occurring, in the case of Unpaywall would be UnpayWallMetadata. Alternatively just reply here 😉

cverluise commented 5 years ago

Hello @lfoppiano,

thanks for the quick answer.

I just pulled 7682839

```sh ~/biblio-glutton/lookup$ git log commit 768283990d02ff7ba355dc2a1ab4f41f8de97080 (HEAD -> master, origin/master, origin/HEAD) Author: lfoppiano Date: Wed Aug 21 12:43:07 2019 +0900 disabling field `journal_issn_l` introduced in the latest unpaidwall dump #37 ... ```

However, I still get the same Exception raised.

Q: Should I recompile something so that the new property is properly taken into account?

Thanks !

lfoppiano commented 5 years ago

@cverluise yes, you need to rebuild it

cd lookup
./gradlew clean build
kermitt2 commented 5 years ago

Hello Cyril!

Just wondering, a new snapshot in August has not been announced on the Unpaywall discussion list afaik (latest for me is April), did you get it via another channel ?

cverluise commented 5 years ago

Hello,

thanks!

I recompiled and it worked... until a new unrecognized field appeared, aka "has_repository_copy", "repository_institution" (so far)

This is what my JsonIgnoreProperties looks like at the moment

JsonIgnoreProperties({"z_authors", "x_reported_noncompliant_copies", "x_error", "journal_issn_l", "has_repository_copy", "repository_institution"})

What is strange is that adding "repository_institution" and recompiling (./gradlew clean build from lookup/) did not solve the issue. Note that during building, I get (see full message below):

> Task :compileJava
Note: Some input files use unchecked or unsafe operations.
```sh ~/biblio-glutton/lookup$ ./gradlew clean build --warning-mode all The AbstractFileCollection.getBuildDependencies() method has been deprecated. This is scheduled to be removed in Gradle 5.0. com.github.jengelman.gradle.plugins.shadow.internal.DependencyFileCollection extends AbstractFileCollection. Do not extend AbstractFileCollection. Use Project.files() instead. > Task :compileJava Note: Some input files use unchecked or unsafe operations. Note: Recompile with -Xlint:unchecked for details. > Task :shadowJar Registering invalid inputs and outputs via TaskInputs and TaskOutputs methods has been deprecated. This is scheduled to be removed in Gradle 5.0. A problem was found with the configuration of task ':shadowJar'. - No value has been specified for property 'mainClassName'. > Task :startShadowScripts Using TaskInputs.file() with something that doesn't resolve to a File object has been deprecated. This is scheduled to be removed in Gradle 5.0. Use TaskInputs.files() instead. ```

I still have ! com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "repository_institution" ....

```sh ! com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "repository_institution" (class com.scienceminer.lookup.data.OALocation), not marked as ignorable (11 known properties: "license", "best", "evidence", "version", "is_best", "updated", "url_for_pdf", "url_for_landing_page", "url", "host_type", "pmh_id"]) ! at [Source: (String)"{"doi": "10.1090/s0002-9939-1994-1211581-x", "year": 1994, "genre": "journal-article", "is_oa": true, "title": "A note on endomorphisms of irrational rotation $C\\sp *$-algebras", "doi_url": "https://doi.org/10.1090/s0002-9939-1994-1211581-x", "updated": "2019-04-22T03:16:09.289207", "oa_status": "bronze", "publisher": "American Mathematical Society (AMS)", "z_authors": [{"given": "Kazunori", "family": "Kodaka"}], "journal_name": "Proceedings of the American Mathematical Society", "oa_locations""[truncated 1277 chars]; line: 1, column: 1030] (through reference chain: com.scienceminer.lookup.data.UnpayWallMetadata["oa_locations"]->java.util.ArrayList[0]->com.scienceminer.lookup.data.OALocation["repository_institution"]) ! at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:61) ! at com.fasterxml.jackson.databind.DeserializationContext.handleUnknownProperty(DeserializationContext.java:823) ! at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1153) ! at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1589) ! at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownVanilla(BeanDeserializerBase.java:1567) ! at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:294) ! at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151) ! at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:286) ! at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:245) ! at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:27) ! at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:127) ! at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288) ! at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151) ! at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4013) ! at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3004) ! at com.scienceminer.lookup.reader.UnpayWallReader.fromJson(UnpayWallReader.java:58) ! at com.scienceminer.lookup.reader.UnpayWallReader.lambda$load$1(UnpayWallReader.java:42) ! at java.util.Iterator.forEachRemaining(Iterator.java:116) ! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ! at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) ! at com.scienceminer.lookup.reader.UnpayWallReader.load(UnpayWallReader.java:41) ! at com.scienceminer.lookup.storage.lookup.OALookup.loadFromFile(OALookup.java:110) ! at com.scienceminer.lookup.command.LoadUnpayWallCommand.run(LoadUnpayWallCommand.java:66) ! at com.scienceminer.lookup.command.LoadUnpayWallCommand.run(LoadUnpayWallCommand.java:22) ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:87) ! at io.dropwizard.cli.Cli.run(Cli.java:78) ! at io.dropwizard.Application.run(Application.java:93) ! at com.scienceminer.lookup.web.LookupServiceApplication.main(LookupServiceApplication.java:68) ```

Any Idea?

Thanks!

cverluise commented 5 years ago

Hello Patrice!

Hello Cyril!

Just wondering, a new snapshot in August has not been announced on the Unpaywall discussion list afaik (latest for me is April), did you get it via another channel ?

I just filled the form on their website (here) and downloaded the file at the aws S3 adress sent back by Unpaywall. Is it non-standard ?

Thanks !

kermitt2 commented 5 years ago

I just filled the form on their website (here) and downloaded the file at the aws S3 adress sent back by Unpaywall. Is it non-standard ?

Yes it's standard! For update, usually it was announced on the mailing list with the new S3 link, maybe they will do it in the next days. Having the new dataset would help to update the parser in biblio-glutton, because as we see they might be several other changes.

lfoppiano commented 5 years ago

Dear @cverluise the error tells you also the class, in the last example the class is different: OALocation:

com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "repository_institution" (class com.scienceminer.lookup.data.OALocation)

The principle is the same but the class is different ;-)

lfoppiano commented 5 years ago

@cverluise I pushed quickly a fix in 3e0943953faa3590829f35c0106fec83ecc1f96f

Have a look. I had not time to test it, sorry.

kermitt2 commented 5 years ago

The documentation of the data schema has not been updated for the new snapshot apparently (see http://unpaywall.org/data-format), so we would need one or two examples to see what are the new fields and see what to do with them - ignoring them might not always be the right way to cope with them!

cverluise commented 5 years ago

The documentation of the data schema has not been updated for the new snapshot apparently (see http://unpaywall.org/data-format), so we would need one or two examples to see what are the new fields and see what to do with them - ignoring them might not always be the right way to cope with them!

Some examples

```json {"doi": "10.1007/s12414-017-0259-1", "year": 2017, "genre": "journal-article", "is_oa": false, "title": "Just what the doctor ordered*", "issn_l": "0168-9428", "doi_url": "https://doi.org/10.1007/s12414-017-0259-1", "updated": "2018-06-14T21:20:48.756545", "oa_status": "closed", "publisher": "Springer Nature", "z_authors": [{"given": "Cara", "family": "Valk"}], "journal_name": "Bijblijven", "oa_locations": [], "data_standard": 2, "journal_is_oa": false, "journal_issns": "0168-9428,1876-4916", "published_date": "2017-09-13", "best_oa_location": null, "journal_is_in_doaj": false} ```
```json {"doi": "10.3886/icpsr02766", "year": null, "genre": "dataset", "is_oa": false, "title": "Project on Human Development in Chicago Neighborhoods: Community Survey, 1994-1995", "doi_url": "https://doi.org/10.3886/icpsr02766", "updated": "2018-06-18T23:27:05.481519", "oa_status": "closed", "publisher": "Inter-university Consortium for Political and Social Research (ICPSR)", "z_authors": [{"given": "Felton J.", "family": "Earls"}, {"given": "Jeanne", "family": "Brooks-Gunn"}, {"given": "Stephen W.", "family": "Raudenbush"}, {"given": "Robert J.", "family": "Sampson"}], "journal_name": "ICPSR Data Holdings", "oa_locations": [], "data_standard": 2, "journal_is_oa": false, "journal_issns": null, "journal_issn_l": null, "published_date": null, "best_oa_location": null, "journal_is_in_doaj": false, "has_repository_copy": false} ```
```json {"doi": "10.1090/s0002-9939-1994-1211581-x", "year": 1994, "genre": "journal-article", "is_oa": true, "title": "A note on endomorphisms of irrational rotation $C\\sp *$-algebras", "doi_url": "https://doi.org/10.1090/s0002-9939-1994-1211581-x", "updated": "2019-04-22T03:16:09.289207", "oa_status": "bronze", "publisher": "American Mathematical Society (AMS)", "z_authors": [{"given": "Kazunori", "family": "Kodaka"}], "journal_name": "Proceedings of the American Mathematical Society", "oa_locations""[truncated 1277 chars]; line: 1, column: 1030] (through reference chain: com.scienceminer.lookup.data.UnpayWallMetadata["oa_locations"]->java.util.ArrayList[0]->com.scienceminer.lookup.data.OALocation["repository_institution"]) ```
cverluise commented 5 years ago

Working Lookup configuration with the August Unpaywall Snapshot Note: will be edited as errors occur

...
@JsonIgnoreProperties({"endpoint_id", "repository_institution"})
...
...
@JsonIgnoreProperties({"z_authors", "x_reported_noncompliant_copies", "x_error", "journal_issn_l", "has_repository_copy", "issn_l"})
...
lfoppiano commented 5 years ago

The documentation of the data schema has not been updated for the new snapshot apparently (see http://unpaywall.org/data-format), so we would need one or two examples to see what are the new fields and see what to do with them - ignoring them might not always be the right way to cope with them!

I was planning to leave this issue open until the new field were integrated (if they rename something we would loose information). I revert back the change.

@kermitt2 I suggest that we make a stable release and we use master for development (I'm also fine to develop on a separate branch, though)

Aazhar commented 4 years ago

Hello @kermitt2 and @lfoppiano I'm facing the same problem described here, I filled the unpaywall form to get the link to download the snapshot

kermitt2 commented 4 years ago

Hello @Aazhar ! Which version of the Unpaywall data dump?

Aazhar commented 4 years ago

after filling the form, I've got this dump : unpaywall_snapshot_2019-11-22T074546

kermitt2 commented 4 years ago

gasp this is a new dump ! I didn't see it on the Unpaywall discussion group.

@lfoppiano I have the impression that the Jackson json marshalling is way too rigid, any unexpected/new attribute breaks the json parsing... while normally json is good for being schema less! maybe we should simply write a stupid json tree parser?

kermitt2 commented 4 years ago

@Aazhar I pushed a quick fix with 6cece257be034c8334dacc47f34bfab1386aea6b to support this dump version

However, it will break again with the new dump for sure, because we can expect new json attributes continuously. Let's leave this issue opened until the json reader becomes robust.

Aazhar commented 4 years ago

great thanks @kermitt2

kermitt2 commented 3 years ago

In the new version, unknown new json fields in the Unpaywall dump are by default ignored to avoid such issue.

kermitt2 commented 2 years ago

I am closing this issue since we now allow new unknown fields in unpaywall to avoid this kind of regular ingestion breaking.