sul-dlss / wasapi-downloader

Java application to download WARCs from WASAPI
Other
6 stars 4 forks source link

remove crawlStartBefore setting - not implemented #85

Closed jmartin-sul closed 7 years ago

jmartin-sul commented 7 years ago

turns out we don't actually have selection logic to deal with that, seems easier to just remove it (alternative would be to implement in the crawl selector, but that seems like more work, and crawlStartAfter is more important, and already implemented).

ndushay commented 7 years ago

Okay, so in discussion with Nicholas, I realized something:

We use command line args to get a set of (candidate) files. These command line args include crawl-start-after, crawl-start-before, collection, crawl, etc.

We then, for reasons such as weird assumptions in Naomi's head, use the webCrawlSelector ... to further select from the FileSet. Except ... why do we need to do that?

At the moment, we have a "get all crawls after crawl-id" implemented in webCrawlSelector, based on a file list. However, the date args ... they are already used in the FileSet reqeust!

Nicholas isn't convinced that he's likely to be concerned much with crawl ids for this purpose -- he will use dates.

And crawl-start-before is important for embargoing purposes -- when we want to download the crawls that are no longer embargoed ... and not the most recent crawls.

SO: do we take the select-by-crawl-date code out of the CrawlSelector class? Do we remove the CrawlSelector class? Its other use is organizing warc files by crawl id ... arguably useful.

@jmartin-sul @tingulfsen

jmartin-sul commented 7 years ago

i think anything in CrawlSelector that could be implemented by instead just passing the param value to the endpoint should be done that way, unless it'd be a lot of work to rip out the existing selection code. the other stuff does sound useful, and so i'd be inclined to keep the class, so as to keep that stuff. also, like the settings class, it can provide a useful layer of indirection. e.g., we were talking about restricting date validation further, but maybe the thing to do is to take an ISO 8601 date, parse it into a calendar object (which actually happens and gets thrown away for validation), and then output that calendar object in a string format acceptable to the endpoint (e.g., YYYY-MM-DDTHH:MM:SSZ). this would provide arbitrary-ish temporal granularity, at the resolution acceptable by the endpoint.

ndushay commented 7 years ago

I just checked - and we conceived of this "selection" happening after asking for the FileSet from our first whiteboarding architecture discussion. It's actually sorta bogus ... because we use the dates to winnow when we request the FileSet. :-P Let's discuss F2F after lunch on Thurs.

ndushay commented 7 years ago

CrawlSelection no longer uses date args, because it happens at FileSet request time. The date args should be left in the command line.

I updated #86 to remove the date arg version of selecting crawls. I think that closes this ticket.