Open anatoly-scherbakov opened 4 years ago
This is a fair thing to consider. So far, we were thinking along the lines of this problem of batch replication being solved independently by the IPFS community (and there has been some progress in this direction), instead of every application inventing the same solution. That said, perhaps a replay-time CLI flag can be introduced to spin a detached thread which can traverse over all the index records and pulls/pins records locally, if not present already. Another option would be to add UI elements in the Admin interface to perform selective or batch pinning of records. Most of our tests so far were performed on data being present in the local IPFS store, but there are tickets to allow attachment of the replay to any IPFS node or simply relying on the global resolver (the latter is only suitable for replay, not for indexing).
a replay-time CLI flag can be introduced to spin a detached thread which can traverse over all the index records and pulls/pins records locally
I would suggest a separate CLI command, something like this: ipwb pin myfile.cdxj
or ipwb pin QmReQCtRpmEhdWZVLhoE3e8bqreD8G3avGpVfcLD7r4K6W
. This command does not need to launch Flask. It only has to fetch the CDXJ file from local or remote source and to pin every hash from it.
I can imagine running this command on a VPS which I create specifically to broadcast my archives to the network at any moment, regardless of whether my home machine is available.
That was one of my motivations to propose factoring out of index.py
to contain routines to work with the CDXJ file itself. It could provide a Pythonic interface to those files that every other component of the system could rely upon, for the sake of DRY.
In that case, we really do not need any addition ipwb
command for this. We can simply use something like awk
to parse the index file and spit out all the IPFS hashes on STDOUT, which can then be supplied to an IPFS client CLI. This problem can simply be solved by adding instructions in the documentation.
That may be the case, but:
I am used to write bash, Python, or GNU make command line utils to automate these tasks, and I can assure this helps a lot and saves a lot of time and effort. The mental overhead should be solved once and for all, and it is much easier to write ipwb pin QmHash
than to copy-paste a lengthy command, especially if you're a newbie user.
Especially when you want to run this in an automated environment.
When you have Redis or some other backend instead of an CDXJ file sitting on your disk (which is imaginable in production), parsing CDXJ files with awk
can no longer work. ipwb pin
however can because from the core code's point of view, it does not make difference whether the data is in CDXJ file on local disk or it has been downloaded into Redis from a file sitting on IPFS.
And especially when your server is running in an environment like AWS Lambda and the data itself is in a shared Redis cache or some other database. I am not sure awk
is even shipped in that environment.
Sure, there are good reasons to add a utility sub-command, but I was suggesting a quick workaround even before we add any additional code to the repo.
I can do that if we are to create an issue for the purpose (or use this one); I would need pinning for my purposes anyway.
I like this idea and involving ipwb in the workflow might make this a simple, intuitive process for users that don't want to futz with long command-line arguments. ipwb for the most part does not explicitly interact with pinning payloads in IPFS, just the implicit pinning that comes with adding it to IPFS as exhibited by the indexer.
This could provide an interesting demo of supplementing one's local "archive" with the payloads from hashes of another. For example, if userA has a CDXJ index of their captures and userB happens to have an index referencing an embedded resource that gives a more temporally coherent composite representation to userA's captures, userA adding references to userB's captures following the explicit pinning procedure would allow userA observe a different composite memento when replaying their own index. This use case would stem on userB's CDXJ (as pinned through ipwb pin userB.cdxj
) being retained and used within userA's replay.
Following userA pinning userB's CDXJ index, should userB's CDXJ somehow be retained for use per above? If not, it would result in metadata being lost and userA pinning data that is unusable in their local ipwb replay system.
would allow userA observe a different composite memento when replaying their own index
I must confess I do not understand this point.
it would result in metadata being lost But the CDXJ indexes are separate files pinned on separate hashes, as are the resources themselves? How can the data be lost if all of these are immutable?
If the resources are identical
I did not mean to imply that resources are identical, just that they may has the same original URI (URI-R). Thus, Bob's representation of a the HTTP response of the URI would have a different IPFS hash than Alice's representation of a HTTP response.
For example, Alice's index has an entry for the URI https://example.com (U1), which, when dereferenced, contains <img src="https://example.com/photo.jpg">
(src value is U2). U1 and U2 were likely not captured at the same time, so there exists a temporal difference, even in Alice's CDXJ between these two URIs of ΔtA. ΔtA could be the result of the archival crawler prioritizing other resources to preserve. How precise the composite memento (the archived web page with all the embedded resources displayed) constitutes its temporal coherence.
When all resources are dereferenced, because of ΔtA and all of the Δt's of the base representation (the HTML page) and each respective embedded resource, it can occur that Bob's representation of U2 has a Δt < ΔtA.
Hm. So there are two timelines of archiving.
alice.cdxj 01:00 UTC | https://example.com 01:10 UTC | https://example.com/photo.jpg
bob.cdxj 01:10 UTC | https://example.com 01:15 UTC | https://example.com/photo.jpg
Now, do you mean that when Bob replays https://example.com looking at the moment of 01:10 UTC, the timelines will be merged and he will see Alice's fetched image instead of the image he would have expected to see, - because Alice's time of archiving the image is closer to the moment Bob is replaying?
Does this happen because usually people are replaying a whole directory of CDXJ files instead of one file, and a normal practice is to merge CDXJ files?
Crawling happens per host, I assume. Can Alice and Bob both create a random unique ID, like UUID4, every time they set out to archive https://example.com - and use that ID to tag every record in the index?
Replay system would know to prefer the records tagged with the same Crawl ID to the ones which used another Crawl ID.
Yes, were they merged, the current logic for resolution is temporal without regard to source, so Bob/Alice viewing https://example.com at 01:10 UTC would get Bob's HTML and Alice's embedded JPG.
The dynamics of most web archives (I believe) is to use solely their set of WARCs for replay with an index they, themselves generate. I am unaware of other systems (beyond ipwb) that work on a set of indexes (e.g., collection of CDXJ files) to replay WARCs beyond their own control.
The crawl policy can vary. For example, if crawling is configured with priority to breadth, a crawler might grab all the HTML and other HTML pages before fetching embedded resources. This might result in more coverage (viewable HTML pages) with the downside being the embedded resources might have changed or no longer be present when they move from the frontier (list of URIs to be preserved) to the horizon (point at which it is preserved). So, example.com might link to foo.com which links to bar.com. A crawler might try to capture these URIs in order: example.com, foo.com, bar.com, example.com/photo.jpg. I am unaware of whether this is typical but old captures of HTML pages without preserved images means that the behavior (or similar) has been exhibited in-practice.
Metadata like crawl source can be present in the WARC but that is a feature of the crawler itself and not guaranteed. I am unsure if most crawlers give a unique identifier to the crawl instance/source. We could attribute metadata to the source of indexing and thus archive in the JSON block within the CDXJ, which would allow the client to, for example, give precedence to their own captures in replaying composite mementos but that can get quickly get complicated.
@machawk1 the potential temporal inconsistency issue you are describing is a result of index merging, not of the IPFS record pinning. The reply system will request closes embedded resources based on closest matching records in the index, irrespective of whether or not corresponding data is pinned/cached in the local/primary IPFS store. If an IPFS record in locally available, it does not automatically jump into the replay until requested and if a record is locally missing, does not make it do-not-replay-and-fallback-to-another-match, it will be attempted to be discovered from peers instead or fail to resolve.
I see. These strategies probably can be configurable. One of possible ways to do it would be to add a hash of the source WARC file into every line of the CDXJ index (just like you mentioned) when ipwb index
'ing it.
@ibnesayeed raises a valid point; pinning only means that certain set of files will be served from your local IPFS node and will not be deleted by IPFS garbage collector. That ensures persistence; otherwise, the files can be easily lost.
Source locking will lose many advantages of merging collections from different sources to enrich the archive and patch pages with missing embedded resources. That said, I remember discussing a new model of archival replay system by building a resource dependency graph index which will allow replay of pages with prespecified versions of embedded resources unless explicitly updated. That model can be implemented in IPWB someday after giving more thoughts to it, but it will be a big change in the system.
Source locking will lose many advantages of merging collections from different sources
I agree with this. The origin of the payload ought to be agnostic of the method or creator though there could be a case of maintaining provenance to ensure the composite memento you view is solely made up of your own captures. There could be a need for this or a desire to be liberal with assembling a composite memento solely on the temporally closest embedded resources.
I remember discussing a new model of archival replay system
This sounds related to your Web Bundling work, @ibnesayeed.
pinning only means that certain set of files will be served from your local IPFS node and will not be deleted by IPFS garbage collector
Right, but if you a user explicitly pins the captures from an external CDXJ but does not have references in their own CDXJ, it seems that their local IPFS node could accumulate a lot of garbage without having a basis for its significance, despite the payload still be accessible.
There could be a need for this or a desire to be liberal with assembling a composite memento solely on the temporally closest embedded resources.
That's the kind of gap transactional on-demand per-page archiving model serves well (archive.today) would be a good example. Crawler based archives on the other hand capture atomic resources with the help of a frontier queue and a recently seen list to minimize repeated downloading of shared resources. To leverage crawl-based archiving while supporting a more coherent and fixity preserving replay I proposed the dependency graph style indexing mentioned above.
This sounds related to your Web Bundling work,
Yes, that was the context when I discussed it with @phonedude, but the model can be devised to support in a non-bundled environment too.
Right, but if you a user explicitly pins the captures from an external CDXJ but does not have references in their own CDXJ, it seems that their local IPFS node could accumulate a lot of garbage without having a basis for its significance, despite the payload still be accessible.
Resources in an IPFS store are unaware of their application. Pin management is something one can perform independently by identifying resources they do not want, so those can be unpinned to be garbage collected. Even at the time of pinning, you may want to add optional filters so that you do not pin every hash in every index you have, if that's a concern. However, the simplified assumption would be to pin everything that is present in any of your indexes, in case those are needed. If you have something in your index, it means you are willing to replay it.
I believe #637 is a prerequisite to implement this.
To ensure constant availability of every file loaded into IPFS from WARC archive, I would like to pin those files. I can see this can be rather straightforward: I only have to parse CDXJ file and pin every hash from it - but that seems tedious and requires extra code.
Would it be possible instead to add not one file but whole directory with all files from the archive plus, say,
index.cdxj
to provide navigation and metadata?Thus, every node who wishes to provide persistence to the data in question would only have to pin the directory itself.
Would you mind sharing your view on this?