ipni / index-provider

📢 Index Provider
Other
35 stars 16 forks source link

Intermittent failure to retrieve advertised content #49

Closed gammazero closed 2 years ago

gammazero commented 2 years ago

Sometimes after publishing an advertisement, when the indexer asks the reference provider for the content of that advertisement, the reference provider cannot find the content. It seems like there is some delay before it is available from the linksystem cache.

This can be forced to work by re-importing the car file and making the reference provider re-publish its advertisement, which triggers the indexer to ask for content again… and it usually works after one or two retries. Of course, this is not acceptable.

This problem appears to be on the reference provider side, because it appears that the indexer is asking for exactly the same content, in exactly the same way/order for both the success and failure case.

gammazero commented 2 years ago

To provide more context, here are reference provider logs for the not working and working cases:

Not Working log:

11:01:29.333    INFO    adminserver     http/connect_handler.go:45      Connected successfully to peer
11:01:32.643    INFO    adminserver     http/importcar_handler.go:17    Received import CAR request
11:01:32.643    INFO    adminserver     http/importcar_handler.go:38    Storing CAR and generating key
11:01:32.653    INFO    provider/engine engine/engine.go:236    Generating linked list of CIDs for advertisement
11:01:32.660    INFO    provider/engine engine/linksystem.go:203        Generated linked chunks of CIDs {"cids": 1048, "chunks": 11}
11:01:32.663    INFO    provider/engine engine/engine.go:125    Storing advertisement locally   {"cid": "baguqeeqq26crh3rkcekcmkzohgfdcdbihe"}
11:01:32.664    INFO    provider/engine engine/engine.go:142    Publishing advertisement in pubsub channel      {"cid": "baguqeeqq26crh3rkcekcmkzohgfdcdbihe"}
11:01:32.664    INFO    adminserver     http/importcar_handler.go:50    Stored CAR      {"path": "testdata/sample-wrapped-v2.car", "key": "dGVzdGRhdGEvc2FtcGxlLXdyYXBwZWQtdjIuY2Fy47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU="}
11:01:32.667    INFO    dt-impl impl/events.go:298      received new channel request from 12D3KooWD1XypSuBmhebQcvq7Sf1XJZ1hKSfYCED4w6eyxhzwqnV
11:01:32.671    INFO    provider/engine engine/linksystem.go:50 Retrieved advertisement from datastore  {"cid": "baguqeeqq26crh3rkcekcmkzohgfdcdbihe", "size": 756}
11:01:32.674    INFO    provider/engine engine/linksystem.go:90 Entry for CID is not cached, generating chunks  {"cid": "baguqeeqqei3djgxzg5urb5kjq77r27vboe"}
11:01:32.683    INFO    provider/engine engine/linksystem.go:203        Generated linked chunks of CIDs {"cids": 1049, "chunks": 11}
11:01:32.683    ERROR   provider/engine engine/linksystem.go:133        No object has been found in linksystem for CID (baguqeeqqei3djgxzg5urb5kjq77r27vboe)
11:01:32.683    ERROR   gs-traversal    runtraversal/runtraversal.go:43 failed to load link=baguqeeqqei3djgxzg5urb5kjq77r27vboe, nBlocksRead=1, err=datastore: key not found
11:01:32.683    WARN    dt-impl impl/events.go:278      data transfer channel 12D3KooWD1XypSuBmhebQcvq7Sf1XJZ1hKSfYCED4w6eyxhzwqnV-12D3KooWFg8NQzaVPf6Q1gTfas2aDMRwxXZjhzRkB2GEYNuw1Q2n-1633629686867138746 failed to transfer data: graphsync response to peer 12D3KooWD1XypSuBmhebQcvq7Sf1XJZ1hK
SfYCED4w6eyxhzwqnV did not complete: response status code RequestCompletedPartial

Working log:

11:06:22.794    INFO    adminserver     http/connect_handler.go:45      Connected successfully to peer
11:06:26.436    INFO    adminserver     http/importcar_handler.go:17    Received import CAR request
11:06:26.436    INFO    adminserver     http/importcar_handler.go:38    Storing CAR and generating key
11:06:26.446    INFO    provider/engine engine/engine.go:236    Generating linked list of CIDs for advertisement
11:06:26.453    INFO    provider/engine engine/linksystem.go:203        Generated linked chunks of CIDs {"cids": 1048, "chunks": 11}
11:06:26.457    INFO    provider/engine engine/engine.go:125    Storing advertisement locally   {"cid": "baguqeeqqgn26gypmqhyw7lohhbgscgpdr4"}
11:06:26.458    INFO    provider/engine engine/engine.go:142    Publishing advertisement in pubsub channel      {"cid": "baguqeeqqgn26gypmqhyw7lohhbgscgpdr4"}
11:06:26.459    INFO    adminserver     http/importcar_handler.go:50    Stored CAR      {"path": "testdata/sample-wrapped-v2.car", "key": "dGVzdGRhdGEvc2FtcGxlLXdyYXBwZWQtdjIuY2Fy47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU="}
11:06:26.462    INFO    dt-impl impl/events.go:298      received new channel request from 12D3KooWD1XypSuBmhebQcvq7Sf1XJZ1hKSfYCED4w6eyxhzwqnV
11:06:26.465    INFO    provider/engine engine/linksystem.go:50 Retrieved advertisement from datastore  {"cid": "baguqeeqqgn26gypmqhyw7lohhbgscgpdr4", "size": 756}
11:06:26.467    INFO    provider/engine engine/linksystem.go:90 Entry for CID is not cached, generating chunks  {"cid": "baguqeeqqmsri646icaeuflhiibylzipply"}
11:06:26.474    INFO    provider/engine engine/linksystem.go:203        Generated linked chunks of CIDs {"cids": 1048, "chunks": 11}
11:06:26.477    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqxam5jxkb3a6av3i7qtss7hbk2y"}
11:06:26.482    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqdmlv2xq5jruhfsc27duxdum2pm"}
11:06:26.488    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqq4mfgm3grdzli42dyj23txli3fe"}
11:06:26.494    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqivt7xo73f5rue5e4w26fh7f3tq"}
11:06:26.499    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqalndlorhxthvwmqd2ozzl3iqpy"}
11:06:26.504    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqpdxyhiuwcyvrqzivye5ce67zdm"}
11:06:26.509    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqgditcfpyw2hzg77rc7feiizpdu"}
11:06:26.514    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqdeuud3pco53wphpafmrhhux5gq"}
11:06:26.519    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqpehvzcslaz4whgiglbtdgdxy7a"}
11:06:26.524    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqq7ygeyjv3adgjwkobmlzydhxyr4"}
11:06:26.529    INFO    provider/engine engine/linksystem.go:50 Retrieved advertisement from datastore  {"cid": "baguqeeqqupgqzbuxjb3riyh66hqs5wobhy", "size": 671}
11:06:26.535    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqmsri646icaeuflhiibylzipply"}
11:06:26.535    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqxam5jxkb3a6av3i7qtss7hbk2y"}
11:06:26.537    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqdmlv2xq5jruhfsc27duxdum2pm"}
11:06:26.538    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqq4mfgm3grdzli42dyj23txli3fe"}
11:06:26.539    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqivt7xo73f5rue5e4w26fh7f3tq"}
11:06:26.540    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqalndlorhxthvwmqd2ozzl3iqpy"}
11:06:26.541    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqpdxyhiuwcyvrqzivye5ce67zdm"}
11:06:26.542    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqgditcfpyw2hzg77rc7feiizpdu"}
11:06:26.543    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqdeuud3pco53wphpafmrhhux5gq"}
11:06:26.543    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqqpehvzcslaz4whgiglbtdgdxy7a"}
11:06:26.544    INFO    provider/engine engine/linksystem.go:120        Found cache entry for CID       {"cid": "baguqeeqq7ygeyjv3adgjwkobmlzydhxyr4"}
11:06:26.545    INFO    dt-impl impl/events.go:249      channel 12D3KooWD1XypSuBmhebQcvq7Sf1XJZ1hKSfYCED4w6eyxhzwqnV-12D3KooWFg8NQzaVPf6Q1gTfas2aDMRwxXZjhzRkB2GEYNuw1Q2n-1633629979902082980: sending completion message to initiator
gammazero commented 2 years ago

I can put a retry loop in the StorageReadOpener function returned by mkLinkSystem(), here, and eventually, it will always work after a seemingly random number of retries. The retry loop MUST include generateChunks or it never works.

The retries are not affected by time. In other words, having a delay between retries does not change the behavior. To make it work, simply retrying regardless of any delay is what is needed.

gammazero commented 2 years ago

With a retry loop, it works as soon as a different CID is tried:

14:51:03.907    INFO    provider/engine engine/linksystem.go:52 Retrieved advertisement from datastore  {"cid": "baguqeeqqal4zylyvyfcjpuakzpbprydzei", "size": 671}
14:51:03.909    DEBUG   provider/engine engine/linksystem.go:76 Checking cache for datacidbaguqeeqqp2qzk5xwnl6zf6cxb6nno4amri
14:51:03.909    INFO    provider/engine engine/linksystem.go:95 Entry for CID is not cached, generating chunks  {"cid": "baguqeeqqp2qzk5xwnl6zf6cxb6nno4amri"}
14:51:03.916    INFO    provider/engine engine/linksystem.go:216        Generated linked chunks of multihashes  {"multihashes": 1048, "chunks": 11}
14:51:03.916    ERROR   provider/engine engine/linksystem.go:142        No obj in linksystem, retrying
14:51:08.917    DEBUG   provider/engine engine/linksystem.go:76 Checking cache for datacidbaguqeeqqp2qzk5xwnl6zf6cxb6nno4amri
14:51:08.917    INFO    provider/engine engine/linksystem.go:95 Entry for CID is not cached, generating chunks  {"cid": "baguqeeqqp2qzk5xwnl6zf6cxb6nno4amri"}
14:51:08.925    INFO    provider/engine engine/linksystem.go:216        Generated linked chunks of multihashes  {"multihashes": 1048, "chunks": 11}
14:51:08.925    ERROR   provider/engine engine/linksystem.go:142        No obj in linksystem, retrying
14:51:13.925    DEBUG   provider/engine engine/linksystem.go:76 Checking cache for datacidbaguqeeqqp2qzk5xwnl6zf6cxb6nno4amri
14:51:13.925    INFO    provider/engine engine/linksystem.go:95 Entry for CID is not cached, generating chunks  {"cid": "baguqeeqqp2qzk5xwnl6zf6cxb6nno4amri"}
14:51:13.933    INFO    provider/engine engine/linksystem.go:216        Generated linked chunks of multihashes  {"multihashes": 1049, "chunks": 11}
14:51:13.936    DEBUG   provider/engine engine/linksystem.go:76 Checking cache for datacidbaguqeeqqxam5jxkb3a6av3i7qtss7hbk2y
14:51:13.936    INFO    provider/engine engine/linksystem.go:127        Found cache entry for CID       {"cid": "baguqeeqqxam5jxkb3a6av3i7qtss7hbk2y"}

Notice as soon as it stops looking for datacidbaguqeeqqp2qzk5xwnl6zf6cxb6nno4amri and starts looking for datacidbaguqeeqqxam5jxkb3a6av3i7qtss7hbk2y it works.

gammazero commented 2 years ago

With some more debug logging in the chunk creation, we can see where that other CID comes from:

07:07.206    DEBUG   provider/engine engine/linksystem.go:200  Created chunk link "baguqeeqq7ygeyjv3adgjwkobmlzydhxyr4"
07:07.206    DEBUG   provider/engine engine/linksystem.go:200  Created chunk link "baguqeeqqpehvzcslaz4whgiglbtdgdxy7a"
...
07:07.208    DEBUG   provider/engine engine/linksystem.go:200  Created chunk link "baguqeeqqxam5jxkb3a6av3i7qtss7hbk2y"
07:07.208    DEBUG   provider/engine engine/linksystem.go:215  Created chunk link "baguqeeqq4525wyma2ntqtehyarcmnitp3q"
07:07.208    INFO    provider/engine engine/linksystem.go:218  Generated linked chunks of multihashes  {"multihashes": 1049, "chunks": 11}
07:07.220    INFO    provider/engine engine/linksystem.go:76   Checking cache for datacidbaguqeeqq4525wyma2ntqtehyarcmnitp3q
07:07.220    INFO    provider/engine engine/linksystem.go:95   Entry for CID is not cached, generating chunks  {"cid": "baguqeeqq4525wyma2ntqtehyarcmnitp3q"}

The above repeats some random number of times.

... Then at some point after link generation, the cache is checked for the second to the last link, and everything appears to work.

07:07.222    DEBUG   provider/engine engine/linksystem.go:200  Created chunk link "baguqeeqq7ygeyjv3adgjwkobmlzydhxyr4"
07:07.222    DEBUG   provider/engine engine/linksystem.go:200  Created chunk link "baguqeeqqpehvzcslaz4whgiglbtdgdxy7a"

07:07.226    DEBUG   provider/engine engine/linksystem.go:200  Created chunk link "baguqeeqqxam5jxkb3a6av3i7qtss7hbk2y"
07:07.226    DEBUG   provider/engine engine/linksystem.go:215  Created chunk link "baguqeeqq4525wyma2ntqtehyarcmnitp3q"
07:07.226    INFO    provider/engine engine/linksystem.go:218  Generated linked chunks of multihashes  {"multihashes": 1049, "chunks": 11}
07:07.229    INFO    provider/engine engine/linksystem.go:76   Checking cache for datacidbaguqeeqqxam5jxkb3a6av3i7qtss7hbk2y
07:07.229    INFO    provider/engine engine/linksystem.go:127  Found cache entry for CID       {"cid": "baguqeeqqxam5jxkb3a6av3i7qtss7hbk2y"}

This seems like two problems:

  1. The last link is not saved in the IPLD linksystem datastore.
  2. The advertisement does not consistently get the last link.
gammazero commented 2 years ago

What is happening is that sometimes, but not always, the CAR iterator, used by the car_suppier is returning multihashes in a different order for the same car file.

I put a debug print here

fmt.Println("mh from CAR:", mh.B58String())

Two separate reads of the same CAR file generated a slightly different order of the last handful of multihashes. The different order causes the chunk CID to be different, resulting in different runs of CAR chunking having different IPLD links. Here is what the difference looks like; only showing the last 7 multihashes since everything else (1000+ multihashes) is the same: Iteration 1:

mh from CAR: 2DrjgbMSZJkSSM3pNf9ZzS5NCWSTsYHLn8b4A7QjRJEy54hqjn
mh from CAR: 15vj7gjjx5PBq2VFzahy
mh from CAR: 122uRfmEvYNy1J7SXwR
mh from CAR: 13aabCA6uJRydQkM
mh from CAR: 13aabCA6uJS8Vz4K
mh from CAR: 1DUhBwk6syC9SnaVX
mh from CAR: 1DUhBwk6syCVEL1wo

Iteration 2:

mh from CAR: 2DrjgbMSZJkSSM3pNf9ZzS5NCWSTsYHLn8b4A7QjRJEy54hqjn
mh from CAR: 13aabCA6uJRydQkM
mh from CAR: 13aabCA6uJS8Vz4K
mh from CAR: 1DUhBwk6syC9SnaVX
mh from CAR: 1DUhBwk6syCVEL1wo
mh from CAR: 15vj7gjjx5PBq2VFzahy
mh from CAR: 122uRfmEvYNy1J7SXwR

The solution is for the CAR iteration results to be sorted.

willscott commented 2 years ago

it's because of https://github.com/ipld/go-car/blob/master/v2/index/mhindexsorted.go#L20 https://github.com/ipld/go-car/blob/master/v2/index/mhindexsorted.go#L160

iterating over a golang map isn't guaranteed stable. fix should be straight forward cc @masih

willscott commented 2 years ago

https://github.com/ipld/go-car/pull/258

gammazero commented 2 years ago

Fixed by ipld/go-car#258 and https://github.com/filecoin-project/indexer-reference-provider/commits/main