Closed LexLuthr closed 11 months ago
CC: @davidd8 @TorfinnOlsen for visibility
Can you point me to a description of what needs to be present in the logs, or the API you want emulated?
Is this something that the index-provider can provide, or does the ingestion need to be verified against one or more indexers?
Can this be queried by using the ipni-cli utility? All information on the provider’s ad chain as well as the indexer’s ingestion of it, is available through the ipni-cli.
When sync was done using Graphsync (default), it would emit events like new request, progress and complete. We could save these events in a DB and display them as logs in UI. https://github-production-user-asset-6210df.s3.amazonaws.com/88259624/264365716-47f22eae-0b0b-46c9-aab1-84a6e592d476.png
After switching to IPNI sync, these events don't exist anymore. So, it becomes hard for an SP to understand if his Boost is syncing with Indexers or not. We want an API that we can query to get similar events. We don't necessarily need to use the same event emitter.
I am not sure that ipni-sync events will exactly translate to what graphsync was providing (or if that was even that accurate when used with IPNI).
For exmaple, ipni makes a separate HTTP request for each advertisement in the chain of ads, until it get to one it has already processed. After the ads are synced then ipni makes a separate request for each entry chunk in each ad, if that ad has multihash chunks. If ipni already has the ad in its CAR mirror, then a request for the ad's entry chunks is not sent to the provider. So, if trying to show ingestion in progress, the provider may not have a great picture overall.
For observing ingesting the data in a single advertisement, the index-provider could log some event for each multihash block. To make that useful, this index provider would need to know 1) which ad each block is associated with, and 2) the total number of blocks in the ad, so that an indication of progress could be given.
1) May be difficult since the requests for multihash blocks are completely separate from requests for ads, and may come at completely separate times. That means the index provider would need to keep some database of block CIDs mapped to their ad CID. 2) May not be practical because it requires reading the advertisement entries blocks to count them, and requires 1.
Would it be useful to log the time that a request was received for an advertisement or entries data? If so, how will the index-provider know whether it is an indexer, or just some other utility crawling the provider's ad chain? Does the index-provider know what the source address for the indexer is? If this is coming over libp2p then we could probably look at the peerID, but not if this is over plain http.
If the goal is just to show when an indexer is making requests, that may be easy to do, but probably the best indication of indexing activity is the provider information that the GUI is already pulling from /providers/<provider_id>
. Watching for changes in the LastAdvertisement
field is obviously useful. Also, the Lag
field tells if there is indexing actively occuring and then number indicates how many ads are left to process. The LastError
and LastErrorTime
fields may be useful to indicate if/when there was a problem that is blocking indexing.
I think we can get last error and time from indexer side. It should be fairly easy to display. For the sync itself and tracking the lag, I have some questions.
ipni-cli
does exactly this when getting the provider information with the --distance
flag. I think it would be better to implement this internally within the Boost and show the lag based on latest value we get from cid.contact. So, final form would look something like.
Latest ad on indexer: baga.... (0 ads behind) Last sync error on indexer: ""/err (x time ago)
Not needed. Implementation will be within boost, and will rely on information from ipni-cli and cid.contact.
We are replacing the Graphsync with HTTP-libp2p. Boost tracks all the retrievals for IPNI and presents them in a UI page. This helps SP get an overview of their IPNi sync. We need something similar for the new HTTP-libp2p sync to avoid feature regression in Boost.