metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

Switch OAI-PMH library in metafacture-biblio #360

Closed fsteeg closed 1 year ago

fsteeg commented 3 years ago

In metafacture-biblio, we depend on org.dspace:oclc-harvester2:0.1.12 (see details).

It's the only version of the OCLC harvester published to Central (see https://mvnrepository.com/artifact/org.dspace/oclc-harvester2). There is a GitHub repo at https://github.com/OCLC-Research/oaiharvester2 which contains a slightly newer version, but is not published to Central.

We came across an issue in the library while using it from OERSI, caused by a call in HarvesterVerb, resulting in duplicte logging output (see workaround). With our current setup, we have no way to properly fix issues like this. We should either depend on the OCLC harvester in a way that allows us to make changes to the code, or switch to a new library.

The OCLC harvester is used in a lot of projects on GitHub, many of which incorporate the code into their repos. The newest, maintained version of the original OCLC code seems to be in the oai-harvest-manager repo: https://github.com/clarin-eric/oai-harvest-manager/tree/master/src/main/java/ORG/oclc/oai/harvester2/verb. That repo however is not published to Central.

One option would be to set up a fork of the original OCLC repo with publishing to Central via GitHub actions. This would already give us the possibility to make changes to the code. We could also ask the oai-harvest-manager folks to contribute their version to that repo.

Another option would be to switch to a different library, like XOAI, which is published to Central.

Discussed with @dr0i: as a first step, we should have a look at XOAI to see if that works for us.

fsteeg commented 1 year ago

It's the only version of the OCLC harvester published to Central (see https://mvnrepository.com/artifact/org.dspace/oclc-harvester2).

When revisting this, I saw there now is a 1.0.0 published on Aug 5, 2022 from https://github.com/DSpace/oclc-harvester2. Yay, thanks @tdonohue! I'll update the dependency and assign @dr0i in the PR.