kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
125 stars 16 forks source link

Fix import for Crossref format 2022 #83

Closed lfoppiano closed 7 months ago

lfoppiano commented 2 years ago

This PR attempts to fix the import of the 2022 crossref file:

karatekaneen commented 1 year ago

@lfoppiano I can have a go at fixing the JS for you. My download is not finished yet, but can you describe what's the problem and I'll try to fix it?

lfoppiano commented 1 year ago

@karatekaneen basically supporting the 2022 format. Contrary to the 2021, it is provided "pretty printed" so it cannot be parsed line by line as it was done before (see #78). Please don't quote me on that, I don't 100% remember the details because the parsing using JS and java were different.

karatekaneen commented 1 year ago

Ok, I'll have a look when my download has finished and I find some time

karatekaneen commented 1 year ago

@lfoppiano I started looking at the javascript app but then quickly gave up 😆 . Decided to rewrite it from scratch in Go to make it easier to read, more performant and easier to test.

I added tests to make sure both 2021 and 2022 dumps works as well as the incremental dumps being used. I don't have access to any of the Crossref premium files so could not add them but happy to do it if a sample can be provided.

There's also currently a problem (haven't opened an issue for it from what I can see) where the relative path to the indexing folder differs if you are running it as a regular application or if you are running it in a docker container. By switching to a static binary we can get rid of all the Node dependencies as well as completely removing the issue with relative paths as long as we put the indexing app inside $PATH.

Have a look here (https://github.com/karatekaneen/crossrefindexer) if there's something more wanted, otherwise I'll try to switch out node in the next couple of days and test on our environment before opening a PR replacing it.

lfoppiano commented 7 months ago

Closing as this it's already integrated in #92