kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
117 stars 15 forks source link

Fixed import for 2022 & 2023 + Rewrote indexing application to Go #90

Closed karatekaneen closed 2 months ago

karatekaneen commented 1 year ago

This builds on top the changes by @lfoppiano in #83 . In addition to his changes I also added a small check in Java to detect if it's json or ndjson with the 2023 dump where the data has a space before the : for each field. I also made some changes to the Dockerfile where the paths were a bit off, think it was a version changed that caused that problem.

Also, instead of updating the Node.JS based application used for indexing I decided to rewrite it in Go to be able to easier run indexing concurrently and reduce the amount of dependencies needed in the final image. However, if you're not running the application in a container you'd need to install Go instead of Node.JS and then install the application from my repo.

By adding the application to $PATH we avoided the issue with relative paths here when running in a container. I have not done any performance comparisons between the old and the new way of indexing but it's currently chewing through about 10k/s on a server with 4gb RAM and 2 CPU cores (but only using 330mb memory). And by removing Node as a dependency our final image size was reduced by approx. 150mb.