Selective Harvesting and metha-cat

miku / metha

Command line OAI-PMH harvester and client with built-in cache.

https://lab.ub.uni-leipzig.de/en/metha/

GNU General Public License v3.0

118 stars 13 forks source link

Selective Harvesting and metha-cat #34

Open tobiasschweizer opened 1 year ago

tobiasschweizer commented 1 year ago

Hi @miku,

We are adding more and more OAI-PMH endpoints and metha does a great job!

I have a question about selective harvesting and metha-cat. I have automated harvesting via crontab. After an initial harvest that gets all records from the earliest day on, we do one selective harvest a week:

metha-sync -T 5m -r 20 -base-dir /mydir -format marcmxl https://zenodo.org/oai2d

Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?

Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?

EDIT: What I'd like to get is the latest version of each record via metha-cat.

Thanks and kind regards,

Tobias

miku commented 11 months ago

Sorry for my overly delayed reply.

Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?

Yes.

Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?

Yes.

EDIT: What I'd like to get is the latest version of each record via metha-cat.

Yes, I understand. So metha does not do much except caching responses so subsequent invocations are faster (that's something I haven't seen a lot in other tools). So be on the safe side with respect to updates, one can always delete the cache for a particular endpoint and start anew.

$ rm $(metha-sync -dir http://my.server.org)

That of course requires some tolerance of possibly stale records - depending on the requirements.

tobiasschweizer commented 11 months ago

No problem and thanks for your response. I'll have a closer look at an endpoint's cache where I assume that a lot of updated records flow in.

Otherwise, metha works nicely and stable :-) it's a part of our automated workflow since a couple of months.

tobiasschweizer commented 1 week ago

Hi @miku,

We have run into some redundancy trouble related to caching. Deleting the cache obviously removes the issue but is not a very good option when dealing with big amounts of data.

Would there be a way to merge updated records into one with metha-cat? So if a record has been published and then updated, could metha-cat simply return the latest record?

I do not know Golang but maybe you could point me to the code where this could possibly happen.

EDIT I now know a little tony bit of Go ... I think this is where the magic happens:

https://github.com/miku/metha/blob/master/render.go#L65-L84

It just iterates over the lists of records for each compressed .xml.gzip and ignores records that do not match the datestamp if from and/or until are set.

Once all the records have been collected, could they be matched by record identifier, taking the latest record if there are several for one identifier?

EDIT 2 I think this is difficult since there is no step of collecting all records in memory before writing them to stdout ... On the other hand, collecting everything is probably a bad idea as there could be several Gigabytes of data. Not sure how to approach this best. Some kind of postprocessing?

miku commented 1 week ago

This is a tradeoff, because we store multiple records per file it's hard to overwrite a particular record. Originally, I opted for the time "windowed" approach, because requesting single record from an endpoint that emits e.g. a few million records would result in the same number of HTTP requests and that is somewhat stressful for the server.

One way it could be addressed would be to request many records (in a time window) at once, but then store them individually on disk, so that a record could be overwritten, if a new version is found. The next question then would be, if one file per record is the right approach.

For the time being, rerunning from scratch is probably the simplest, albeit crude, approach.

tobiasschweizer commented 3 days ago

Thanks for the explanation.

One way it could be addressed would be to request many records (in a time window) at once, but then store them individually on disk, so that a record could be overwritten, if a new version is found. The next question then would be, if one file per record is the right approach.

So metha-sync would make sure that there is exactly one representation / file stored per record (also if this record has been updated). Then the virtual XML produced metha-cat would already be without duplicates.

Looking at the current behaviour, metha-sync creates GZIP files with multiple records in them organised by publication date. So working on that would affect both metha-sync and metha-cat (render.go). Obviously, the new version would not be compatible with old caches unless there is some kind of migration assistant.

What else would be affected? I could offer my support in working on that. I do not know Golang but at least I managed to run it from the CLI ...

My motivation: I think incremental harvesting is the one thing OAI-PMH is great at and it would be a pitty to give that away.