What is the best way to "crawl" a repository?

mjordan commented 6 years ago

The "fetchresourcelist" plugin gets a list of resource URLs to check. This is just a demo. In production, we'll need a way to check every object in the repository. What's the best way to do this?

Start at the smallest node ID in the Drupal database and page through it, getting the Fedora URIs for the corresponding resource we want to check ("Original File" and/or "Preservation Master File")?
Perform a similar SPARQL query against the triplestore?
Perform a Solr query?

Riprap's plugin architecture will allow us to do all of these. What are people's preferences?

whikloj commented 6 years ago

Probably a good idea to move along a pre-configured path. So for Fedora it could be following ldp:contains from down through the repo.

ajs6f commented 6 years ago

That would work for any LDP impl, although some may not guarantee that the whole repo graph is connected (Trellis does not). It is also excellently parallelize-able.

whikloj commented 6 years ago

Can you have resources in the repo graph that are not contained? I am learning new things all the time.

But yes this is definitely a LDP-specific suggestion. Based on the plugin architecture, I think your source plugin could define the method.

So if you source is an AWS filesystem, perhaps there is a REST service you call. If its a regular filesystem, you scan the directory and generate a sha1 sum, etc.

ajs6f commented 6 years ago

Yep, what lets you parallelize fully is the crawling bit-- the actual mechanism by which you get the next resource is totally an impl detail. I"m not saying that crawling is the only way to keep fully parallel, just that it's pretty convenient, easy to understand, and there are a variety of off-the-shelf crawlers and web spiders out there for reuse.

mjordan commented 6 years ago

As I was saying during the last CLAW call, I had imagined paging through the repository, fixity-checking a page of resources (say 1000) per scheduled job. That's the way that 7.x's Checksum Checker works now, and allows the repo administrator to spread the load of the checks over time. What options for paging through the LDP container are there, e.g. using offset/limit pages? Also, in this comment @ajs6f mentions that the community implementation of Fedora currently suffers from the many-members problem. If we page through the container, do me mitigate that issue?

@whikloj yes, I built in the plugin architecture admins can enable the 'fetchresourcelist' plugin that best suites their repo and resources. So we're not tied to one way of doing this.

ajs6f commented 6 years ago

@mjordan Paging. Oh, goodness, gracious, paging. It's hard enough that the LDP Working Group could not get working impls to confirm a recommendation as part of their effort and instead offered a note. I honestly do not know how well implemented it is today. @acoburn, do we have that for Trellis?

Besides paging, though, for this particular application an HTTP response that streams well might be good enough. That's a fairly low bar to set for the backend.

mjordan commented 6 years ago

Yes, agreed, paging is a headache. The current plugin is just a proof of concept (it reads resource URLs from a file, and those URLs exist in the bundled mock Fedora-spec-compliant repository) but I'd advocate for making the first production plugin as configure-and-forget for as many early CLAW implementers. I'm not sure what that plugin would look like but please keep the ideas coming....

acoburn commented 6 years ago

Trellis does not do paging (nor will it). However... if you'd like to simulate paging, here's what you'd do:

Execute a GET request at the root -- request application/n-triples and use Prefer headers to fetch only containment triples. Save the output directly to a local file.

The client then reads as many lines of that file as desired (e.g. 10 lines or 1,000 lines).

Iterate over that file, parsing each triple, and fetch each child resource as above. Save the output to a second local file. Continue to append to this second file.

Once the first file has been exhausted, start reading from the second file. Again, as child resources are identified, save those resources to yet another file. Continue this process until there are no more resources.

The big issue with paging is that, for a server to support it, it needs to support it in the "general" case -- that is considerably harder than supporting it in the "particular" case. The "general" case needs to support the possibility of blank nodes, while the "particular" case (described above) can side-step that issue. Plus, by just streaming n-triples to a file (assuming the server is streaming n-triples, as Trellis does), the queries are light weight and fast. That's how I would recommend approaching this issue, even when there are very large containers.

mjordan commented 6 years ago

@acoburn thanks for the detailed overview of how to simulate paging. Very useful.

mjordan / riprap

What is the best way to "crawl" a repository? #6