gleanerio / gleaner

Gleaner: JSON-LD and structured data on the web harvesting
https://gleaner.io
Apache License 2.0
17 stars 11 forks source link

Gleaner does not fully implement incremental indexing correctly at this time. #44

Open fils opened 3 years ago

fils commented 3 years ago

At present Gleaner is not doing incremental indexing (--mode diff) correctly.

While Gleaner does know not to index URLs previously retrieved (summoned) in this mode. It currently does not know if the associated object retrieve has been updated and hence gets a new object based on an update SHA hash. This can result in extraneous objects in the object store when multiple full indexes are done. These could then get synchronized to the triplestore or other indexes resulting in extra or older resources in the indexes along side the newer ones.

Also, the sitegraph workflow, unlike the sitemap workflow is not yet (I believe) leveraging the --mode diff mode of indexing (ie, incremental) so this will need to be added.

One or both of these points is likely this case for this extra object getting in.

Screen Shot 2021-11-05 at 10 29 34 AM

Previous work

Previously, I was leveraging the generated prov records to address this function via S3Select calls. This had the benefit of using existing tooling to address this function. So no additional technical debt. Unfortunately this approach is far too slow as the resource count begins to grow. For some emerging work with a couple communities, where the scale will be approaching a million records and more, this was never going to scale well.

Bolt KV store

To address this I have already started to integrate a KV store to address holding a record of the previously visited resources. This can then be used to check and skip such records. At this time, that capacity has been merged into the dev (and master) branches and should be working and fully replacing the S3Select based approach

Note, that this does generate a gleaner.db file during the run. So this document should be noted in the development of the Docker files.

Note, if this file is lost or removed all that is lost is the record that supports the incremental indexing. No data or other information. So one would simply have to do a full index again to rebuild it and return the capacity of incremental indexing.

Existing limitation to be addressed

The second major issue right now is that code does not take into account that URLs might be removed. So a "prune" option needs to be added that will look for URLs in the warehouse that are not in the domains sitemap and remove them and the associated object that has been downloaded from that, now deleted, URL.

Regarding sitegraphs

The sitegraph concept can be implemented in the same manner as a sitemap. However, since it is one large file, any change will result in a new hash, and so this is a case where the sitegraph approach does have some operational implications a more fine grained approach like sitemaps does not suffer from.

Pattern thoughts

Just a few notes, more for myself on the steps the code needs to do when running in full, incremental and prune approaches.

(inc)
if sitemap url in kv:
    skip
else:
    index
    put sitemap url in kv

(full)
if sitemap url in kv:
    index
        delete old  kv value (object sha)
    replace kv value (object sha)
else:
    index
    put url in kv

(prune)
if kvurl in sitemap
    skip
else:
    delete associated object
    delete kvurl entry
fils commented 3 years ago

Created https://github.com/gleanerio/gleaner/tree/df--dev_issue44 for this.