vufind-org / vufindharvest

VuFind Harvest Tools
GNU General Public License v2.0
21 stars 8 forks source link

Add stopAfter setting to limit number of records harvested #13

Closed tom-dea closed 1 year ago

tom-dea commented 1 year ago

this is more a sketch. it works for me so far. i added the testing parameter to have the advantage of not having to harvest whole sets in test situations.

demiankatz commented 1 year ago

Thanks, @tom-dea, this is a great start! I think there are two next steps:

1.) We should add test coverage to confirm that the flag does what is expected. It might be possible to copy the existing test case for handling resumption tokens (https://github.com/vufind-org/vufindharvest/blob/dev/tests/unit-tests/src/VuFindTest/OaiPmh/HarvesterTest.php#L305-L339) and adjust the expectations to confirm that the second page does NOT get requested when the flag is set.

2.) It may be worth some discussion/brainstorming about the actual name of the setting. I wonder if something more descriptive would be better, like "ignoreResumptionTokens." I think describing what the setting does, rather than what it is for, might be more clear -- "harvestTestData" could potentially mean a lot of different things, but "ignoreResumptionTokens" is very specific. I'm open to other ideas or further discussion, of course!

tom-dea commented 1 year ago

Thanks, @tom-dea, this is a great start! I think there are two next steps:

1.) We should add test coverage to confirm that the flag does what is expected. It might be possible to copy the existing test case for handling resumption tokens (https://github.com/vufind-org/vufindharvest/blob/dev/tests/unit-tests/src/VuFindTest/OaiPmh/HarvesterTest.php#L305-L339) and adjust the expectations to confirm that the second page does NOT get requested when the flag is set.

2.) It may be worth some discussion/brainstorming about the actual name of the setting. I wonder if something more descriptive would be better, like "ignoreResumptionTokens." I think describing what the setting does, rather than what it is for, might be more clear -- "harvestTestData" could potentially mean a lot of different things, but "ignoreResumptionTokens" is very specific. I'm open to other ideas or further discussion, of course!

1) i fully agree. will do that in one of the next steps. 2) currently the behaviour is not exactly ignoring resumptionTokens. i want to improve that, rename the flag to "ignoreResumptionTokens". good idea. a much clearer name ;-)

i will contact you or comment here as soon as i think that i am done with that

demiankatz commented 1 year ago

Sounds great -- thanks, @tom-dea!

tom-dea commented 1 year ago

i just added a test case by adaption a copy of testListRecordsWithResumption() in https://github.com/vufind-org/vufindharvest/blob/dev/tests/unit-tests/src/VuFindTest/OaiPmh/HarvesterTest.php#L305-L339 the meodld is called testListRecordsWithIgnoreResumptionTokenOption() and is just below. it passes the test run.

i am not totally sure if i understand the testing methods right.

tom-dea commented 1 year ago

@demiankatz just added some sentences to documentation:

; ignoreResumptionTokens may be set to true in order to harvest just the first list ; response from the OAI repository. the resumption tokens will be ignored. this option ; can be used for testing purposes. it allows to harvest smaller data sets.

demiankatz commented 1 year ago

Thanks again, @tom-dea! I've made some minor stylistic edits to your new comment, and I think this is looking good now. I've requested a review from @EreMaijala in case he has any concerns or alternate ideas from his work on RecordManager.

I did think of one possible scenario where something different might be needed for testing: sometimes an OAI source spits out many pages of deleted records before providing any "real" records. In that scenario, this setting would not be helpful. However, that's probably a fairly rare edge case and probably doesn't prevent this setting from being useful. I could just imagine there might be a situation where we would want a "getFirstNonDeletedRecord" setting or something like that. (I'm not asking you to change anything, just brainstorming/thinking ahead).

tom-dea commented 1 year ago

thank you @EreMaijala for your suggestions. this sounds reasonable to me. i will have a look at this. btw i already missed an overall counter of harvested records in the logging of vufind harvester since a long time. maybe this number could also show up in logs.

i just have a look at how to implement your idea and give feedback here...

tom-dea commented 1 year ago

just tried to implement the stopAfter option

demiankatz commented 1 year ago

Thanks, @tom-dea, this looks good to me! I just made a few small edits in commit 4f7c397 since the stopAfter property is no longer nullable, so the type can be plain int (and I figured I'd make a note about the meaning of -1 while I was in the area).

Now the next step is to get the tests passing again -- it looks like the refactoring has broken something. Please let me know if you need any help figuring it out!