ipfs / go-graphsync

Initial Implementation Of GraphSync Wire Protocol
Other
100 stars 38 forks source link

On resume graphsync occasionally enqueues the same outgoing block twice for a single file #158

Open dirkmc opened 3 years ago

dirkmc commented 3 years ago

While running go-data-transfer tests I've noticed that graphsync seems to occasionally enqueue the same outgoing block twice for the same file. This appears to happen when the transfer is paused and then resumed.

This causes an issue with go-data-transfer because the provider calculates the queued bytes by summing the size of all queued blocks. However when a duplicate block is queued, the block doesn't actually get sent across the wire. So the provider may expect payment for a block that hasn't been sent.

The provider uses RegisterOutgoingBlockHook() to watch for enqueued blocks.

I've created a test in go-data-transfer to demonstrate the issue: https://github.com/filecoin-project/go-data-transfer/pull/152

In this case the transfer is paused after the second block is sent. When the transfer is resumed the second block is sent again. See the example below:

$ go test ./impl -run TestGSDoubleSend -count=10 -v
=== RUN   TestGSDoubleSend
=== RUN   TestGSDoubleSend/fast_unseal,_payment_channel_ready
0: QmcJoNHgRhpowes6X3YHxxhhi9Vu1MqYfLKoDGtCWy7eae
1: bafkreiegr5x3na4gsjokgc3oyp6evqr4brnvuuamqv7qf5j55f46afbwa4
2: bafkreifaglzaff5gxfehb7o37c3gpq2yzvcvzi7to4lam6h726p2w5u46y
3: bafkreiblnz4nfgic5qyh7l5tgfx2676fk7ve6vkbhjfoicsqsjo7jbdt5m
4: bafkreicbrli2logxql3h36sgh6wfqouxibsiuzymnwvi2cknxdoc7wtknu
5: bafkreigzpidqubh2afgmapairjdtwhcmh332klpaouasyuurdhwu2txxju
6: bafkreibqsy6pn3y5oy6m2dnhacbimqfx4l6352gz27kikbvtf6kyvjc7ly
7: bafkreifda52jtpkh2bavgdnlit5yeoa3xlj42i5evjgsgefyh6r3mull7m
8: bafkreialq7brtimc3zyrgg753ndukywv5qphu2lkyv63twp5bwncq77z74
9: bafkreig6sewnzawa2yhqfpf54dex7f3wo2m6royoldvwiydrcm2qot233i
10: bafkreihujcyc5tqemqys7wxrgo7n6bsfkswfc6yxw4x6npgb3kxmezb4pe
11: bafkreiahyrgpq52w6ijimwoecbrpn7egx6fxn3wavevb47r7cgayadx7sq
12: bafkreibogblledlbehxp3z25sl7ixlweqg5cirawe6ah5htfj6exiv4rhm
13: bafkreib5kvazffowsdcyqtzifpyq4jaztatodlxmsl7si5622ad2hm4isi
14: bafkreib6bvpds3jyylmfe6b77qd73llqs53enwmjlcedwtpjpr24h4i4qm
15: bafkreififpy27chexnldi2xlpn2ciygbt6oervjuqrvoxsowjfrwsaqxuq
16: bafkreiaqymh43iwxge5lcnv65kj5oxczspwxmp5bude7qqhletvfuufgdu
17: bafkreibuygwyfnws63k6cc4krmi7lhn5zzpqqarx3bj65cruubmwl6p2r4
18: bafkreigblr6bvqcppjsdy4qcuer7o67ol57b7jec642pxgljuemfqzmd64
19: bafkreia3lhfluhxwhnfjmfzvsmjyxdhjb466pep3njg6zcoy57tavgtjvi
20: bafkreign7cs2fgqrufywaniauz43ktvxjgi6t66curvhsfzcguqn2gapoe
  provider completed
  client completed
--- PASS: TestGSDoubleSend (0.02s)
    --- PASS: TestGSDoubleSend/fast_unseal,_payment_channel_ready (0.02s)

=== RUN   TestGSDoubleSend
=== RUN   TestGSDoubleSend/fast_unseal,_payment_channel_ready
0: QmcJoNHgRhpowes6X3YHxxhhi9Vu1MqYfLKoDGtCWy7eae
1: bafkreiegr5x3na4gsjokgc3oyp6evqr4brnvuuamqv7qf5j55f46afbwa4
2: bafkreiegr5x3na4gsjokgc3oyp6evqr4brnvuuamqv7qf5j55f46afbwa4
    Already saw this CID - it was sent in position 1
FAIL    github.com/filecoin-project/go-data-transfer/impl   1.470s
hannahhoward commented 3 years ago

@dirkmc I believe the root cause of this is pause/resume on the requestor side being implemented in a piss poor and unreliable way -- completely my fault, but I'm hoping you might do something for me to confirm: in https://github.com/filecoin-project/go-data-transfer/pull/152, can you remove the code that pauses the client, so it's just the provider pausing, and see if the problem goes away?

If so, we should have a conversation about the best way to implement pause resume on the requestor as its not obvious.

dirkmc commented 3 years ago

@hannahhoward I removed the pausing on the client side, and I was able to run the test 1,000 times in a row with no failures.

Could you explain what the issue is and what you're thinking in terms of a solution?