At present Gleaner is not doing incremental indexing (--mode diff) correctly.
While Gleaner does know not to index URLs previously retrieved (summoned) in this mode. It currently does not know if the associated object retrieve has been updated and hence gets a new object based on an update SHA hash. This can result in extraneous objects in the object store when multiple full indexes are done. These could then get synchronized to the triplestore or other indexes resulting in extra or older resources in the indexes along side the newer ones.
Also, the sitegraph workflow, unlike the sitemap workflow is not yet (I believe) leveraging the --mode diff mode of indexing (ie, incremental) so this will need to be added.
One or both of these points is likely this case for this extra object getting in.
Previous work
Previously, I was leveraging the generated prov records to address this function via S3Select calls. This had the benefit of using existing tooling to address this function. So no additional technical debt. Unfortunately this approach is far too slow as the resource count begins to grow. For some emerging work with a couple communities, where the scale will be approaching a million records and more, this was never going to scale well.
Bolt KV store
To address this I have already started to integrate a KV store to address holding a record of the previously visited resources. This can then be used to check and skip such records. At this time, that capacity has been merged into the dev (and master) branches and should be working and fully replacing the S3Select based approach
Note, that this does generate a gleaner.db file during the run. So this document should be noted in the development of the Docker files.
Note, if this file is lost or removed all that is lost is the record that supports the incremental indexing. No data or other information. So one would simply have to do a full index again to rebuild it and return the capacity of incremental indexing.
Existing limitation to be addressed
The second major issue right now is that code does not take into account that URLs might be removed. So a "prune" option needs to be added that will look for URLs in the warehouse that are not in the domains sitemap and remove them and the associated object that has been downloaded from that, now deleted, URL.
Code to do this prune operation is the next development work. and several support functions for this are already done *
Regarding sitegraphs
The sitegraph concept can be implemented in the same manner as a sitemap. However, since it is one large file, any change will result in a new hash, and so this is a case where the sitegraph approach does have some operational implications a more fine grained approach like sitemaps does not suffer from.
Pattern thoughts
Just a few notes, more for myself on the steps the code needs to do when running in full, incremental and prune approaches.
(inc)
if sitemap url in kv:
skip
else:
index
put sitemap url in kv
(full)
if sitemap url in kv:
index
delete old kv value (object sha)
replace kv value (object sha)
else:
index
put url in kv
(prune)
if kvurl in sitemap
skip
else:
delete associated object
delete kvurl entry
At present Gleaner is not doing incremental indexing (--mode diff) correctly.
While Gleaner does know not to index URLs previously retrieved (summoned) in this mode. It currently does not know if the associated object retrieve has been updated and hence gets a new object based on an update SHA hash. This can result in extraneous objects in the object store when multiple full indexes are done. These could then get synchronized to the triplestore or other indexes resulting in extra or older resources in the indexes along side the newer ones.
Also, the sitegraph workflow, unlike the sitemap workflow is not yet (I believe) leveraging the --mode diff mode of indexing (ie, incremental) so this will need to be added.
One or both of these points is likely this case for this extra object getting in.
Previous work
Previously, I was leveraging the generated prov records to address this function via S3Select calls. This had the benefit of using existing tooling to address this function. So no additional technical debt. Unfortunately this approach is far too slow as the resource count begins to grow. For some emerging work with a couple communities, where the scale will be approaching a million records and more, this was never going to scale well.
Bolt KV store
To address this I have already started to integrate a KV store to address holding a record of the previously visited resources. This can then be used to check and skip such records. At this time, that capacity has been merged into the dev (and master) branches and should be working and fully replacing the S3Select based approach
Existing limitation to be addressed
The second major issue right now is that code does not take into account that URLs might be removed. So a "prune" option needs to be added that will look for URLs in the warehouse that are not in the domains sitemap and remove them and the associated object that has been downloaded from that, now deleted, URL.
Regarding sitegraphs
The sitegraph concept can be implemented in the same manner as a sitemap. However, since it is one large file, any change will result in a new hash, and so this is a case where the sitegraph approach does have some operational implications a more fine grained approach like sitemaps does not suffer from.
Pattern thoughts
Just a few notes, more for myself on the steps the code needs to do when running in full, incremental and prune approaches.