GraphSync: MQ attemptSendAndRecovery retries have malformed length

danny-wu commented 4 years ago

Issue

I am encountering stuck filecoin deal transfers, as well as I/O timeouts from dealbot. Upon further investigation, it looks like GraphSync's attemptSendAndRecovery's resends have malformed lengths and continuously fail.

See below for the miner log. Note that Graphsync ReceiveError: stream reset error followed by exactly 20 Graphsync ReceiveError: short buffers.

2020-09-01T10:16:50.552Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventOpen", "proposal CID": "bafyreihib4js3vrpfrifibg2r76jkc5q4zjof4iyj2lwgxig2a5ihuvmau", "state": "StorageDealValidating", "message": ""}
2020-09-01T10:16:50.555Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDealDeciding", "proposal CID": "bafyreihib4js3vrpfrifibg2r76jkc5q4zjof4iyj2lwgxig2a5ihuvmau", "state": "StorageDealAcceptWait", "message": ""}
2020-09-01T10:16:50.587Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDataRequested", "proposal CID": "bafyreihib4js3vrpfrifibg2r76jkc5q4zjof4iyj2lwgxig2a5ihuvmau", "state": "StorageDealWaitingForData", "message": ""}
2020-09-01T10:16:50.828Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreihib4js3vrpfrifibg2r76jkc5q4zjof4iyj2lwgxig2a5ihuvmau", "state": "StorageDealTransferring", "message": ""}
2020-09-01T10:17:35.394Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: stream reset
2020-09-01T10:17:48.289Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:15.300Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:16.048Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:29.050Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:30.301Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:39.540Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:40.028Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:48.125Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:52.574Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:18:56.703Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:02.134Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:02.479Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:06.699Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:07.052Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:09.380Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:11.110Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:11.720Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:13.617Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:13.963Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer
2020-09-01T10:19:15.838Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: short buffer

Given that sendMessage will issue a reset, and then attempt to retry 10 times (a multiple of 20),, it looks like a condition has caused incorrect buffer lengths to be repeatedly sent to the receiver for both messages. The retries all fail due to this error, and the connection is terminated.

Please note the timestamps -- the errors span 1 minute and 40 seconds.

Ruling out my internet connection

I have set up a remote connectivity tester, which opens a socket to the miner my port every minute. Logs are available here: http://139.180.162.246/readlog.php

Given that I am able to connect every minute and the short buffer errors span more than 1 minute, I think I can rule out my internet connection being down or messing up for the entire period of 1 min 40 seconds.

Furthermore, while anecdotal, I have no issues transferring terabyte files to the public internet over SCP, over days, without any interruptions, broken pipes, or resets.

My continuous SSH connections, from my laptop to the miner machine, have not been broken for more than 72 hours since I opened it.

welcome[bot] commented 4 years ago

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review. In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment. Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

"Priority" labels will show how urgent this is for the team.
"Status" labels will show if this is ready to be worked on, blocked, or in progress.
"Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

hannahhoward commented 4 years ago

so one thing that is confusing is the that the message you are seeing is not coming from the send message code, it's coming from ReceiveError, which is triggered in [network/libp2_impl.go])https://github.com/ipfs/go-graphsync/blob/master/network/libp2p_impl.go#L145), which is part of the code to handle an INCOMING connection rather than an outgoing message. HandleNewStream should not be triggered when you open a connection on your side -- only when someone opens a connection to you (correct @raulk / @aarshkshah1992 ?)

Can I get some more background on where this came up? Were you receiving or sending piece data? I have in fact seen this issue come up before -- it happens when the incoming message is larger than network.MaxMessageSize -- however I thought we'd resolved it with code we have now in response sending (https://github.com/ipfs/go-graphsync/blob/master/responsemanager/peerresponsemanager/peerresponsesender.go#L368). One way this could get triggered is if the block data is particularly large -- but that probably shouldn't happen with default chunking strategies for filesystem data used by LOTUS.

danny-wu commented 4 years ago

I have received this error while the Dealbot was trying to make a storage deal with me.

Basically I was at the receiving end of the storage deal, and I think I was being sent the 128 MiB pieces. So receiving.

I, along with other users, have noticed a weird tendency for this to happen around 5120 'size' (bytes I presume?) received.

danny-wu commented 4 years ago

They are all receiving challenges:

miner@miner0:~$ lotus-miner data-transfers list
Sending Channels

ID  Status  Sending To  Root Cid  Initiated?  Transferred  Voucher  

Receiving Channels

ID     Status   Receiving From  Root Cid     Initiated?  Transferred  Voucher                                   
23579  Ongoing  ...774A4Hyd     ...lkvf4sfo  N           2102362      ...r76jkc5q4zjof4iyj2lwgxig2a5ihuvmau"}}  
24128  Ongoing  ...774A4Hyd     ...rets7dse  N           5210         ...qzywc56y7i3mfqb6kwz6ifhmcwzdnqqzzq"}}  
24685  Ongoing  ...774A4Hyd     ...vddz375c  N           2102362      ...blvfrvzveqyb2ovfkmpmxb72sqkaidrgui"}}  
25648  Ongoing  ...774A4Hyd     ...pfmkukvo  N           12588122     ...ldtnj6pdsy4ykhbizpao3vd3sddmkn2ala"}}  
26783  Ongoing  ...774A4Hyd     ...5myhxrpk  N           5210         ...6iqmn6ibkw2dum6iaq7otyeoeq5ejtcxuu"}}  
27922  Ongoing  ...774A4Hyd     ...mkjjsvh4  N           5210         ...chqwvfkwr4wtyjn4re5dnsh57gg37pstju"}}  
28501  Ongoing  ...774A4Hyd     ...t4ax4qoi  N           22025306     ...ygjcprcw2a4bozwx2b7b2y3etdrr3zlpbq"}}  
29091  Ongoing  ...774A4Hyd     ...tww3u3je  N           2102362      ...gcbuocdvrr542bztpkczydyht7irflzjy4"}}

All of them are stuck.

hannahhoward commented 4 years ago

thanks. Ok, so seems like somehow the stream buffer is full when send/recovery happens.

hannahhoward commented 4 years ago

follow-up: I'm a bit stumped

I wrote a benchmark to try to replicate the conditions you're encountering with the bot and I'm finding it hard to get whatever is happening to happen. https://github.com/ipfs/go-graphsync/pull/91

This sets up 20 simultenous transfers of 128MB files seperated in 1MB chunks similar to filecoin. We don't have the ability to test a real network connection, but just to be safe we can bring the whole mock connection down which is what I do in the middle of the test. It seems to survive that and complete successfully... leading me to believe we're now dealing with a problem that only manifests under a real network connection.

hannahhoward commented 4 years ago

@raulk if you or anyone else on libp2p has any thoughts on what could be happening here, that'd be cool.

Last time I saw this issue, it was cause we were constructing messages that were too large -- 4MB or greater. There should be code in graphsync to immediately start a new message if your total size goes about 512K now. Reading an incoming message over 4MB will fail with a ShortBuffer error code from gogo-protobuf.

Does Libp2p have an internal message size limit? Filecoin chunks up unixFS files into fairly large blocks -- is there a scenario in which a 1MB block could break a message send? Should we consider lowering this? Is there any reason if the stream is reset and we reopen a connection that old data would still be in the stream?

Also though, to review: this isn't happening to all miners to my understanding, but somewhat randomly to some.

danny-wu commented 4 years ago

@hannahhoward The only thing that comes to my mind is that I have jumbo frames enabled (a MTU of 8192 specifically). My switch supports jumbo frames and this happens at the IP level, so I doubt it's causing anything here, but that's my networking differ from stock.

Would it help for me to try and capture a tcpdump of this happening?

danny-wu commented 4 years ago

I'm going to try and get a complete tcpdump of a stream reset and malformed length error happening; when dealbot is back online. I'll update in this ticket.

raulk commented 4 years ago

Just looking at this now. Thanks for the detailed report, @danny-wu.

raulk commented 4 years ago

@danny-wu a tcpdump won't help much here because all traffic is encoded. My guess is that the stream is being reset, but for some reason we keep using it. One hypothesis is that multiple threads could be writing to the same stream concurrently. That could definitely cause issues like that. Let me look.

danny-wu commented 4 years ago

Thanks so much @raulk; hope you can find the cause 🙏

Something interesting I find is that the stream resets tend to happen approximately 1 minute after the storage transfer begins. For example:

2020-09-06T10:33:53.664Z        ^[[34mINFO^[[0m storagemarket_impl      impl/provider.go:199    Handling storage deal proposal!
2020-09-06T10:33:53.904Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventOpen", "proposal CID": "bafyreigoez74stgjzdkznnpoufhmceui2yjvqbpkpeknvr446oi3wlo3w4", "state": "StorageDealValidating", "message": ""}
2020-09-06T10:33:53.930Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDealDeciding", "proposal CID": "bafyreigoez74stgjzdkznnpoufhmceui2yjvqbpkpeknvr446oi3wlo3w4", "state": "StorageDealAcceptWait", "message": ""}
2020-09-06T10:33:53.968Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDataRequested", "proposal CID": "bafyreigoez74stgjzdkznnpoufhmceui2yjvqbpkpeknvr446oi3wlo3w4", "state": "StorageDealWaitingForData", "message": ""}
2020-09-06T10:33:54.216Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreigoez74stgjzdkznnpoufhmceui2yjvqbpkpeknvr446oi3wlo3w4", "state": "StorageDealTransferring", "message": ""}
2020-09-06T10:34:57.815Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: stream reset
...

2020-09-06T13:35:06.885Z        ^[[34mINFO^[[0m storagemarket_impl      impl/provider.go:199    Handling storage deal proposal!
2020-09-06T13:35:07.128Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventOpen", "proposal CID": "bafyreif7o6bvtdkob6ntkm32r54thehlhghag2576xihfmhdushsua4aqm", "state": "StorageDealValidating", "message": ""}
2020-09-06T13:35:07.146Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDealDeciding", "proposal CID": "bafyreif7o6bvtdkob6ntkm32r54thehlhghag2576xihfmhdushsua4aqm", "state": "StorageDealAcceptWait", "message": ""}
2020-09-06T13:35:07.185Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDataRequested", "proposal CID": "bafyreif7o6bvtdkob6ntkm32r54thehlhghag2576xihfmhdushsua4aqm", "state": "StorageDealWaitingForData", "message": ""}
2020-09-06T13:35:07.722Z        ^[[34mINFO^[[0m markets loggers/loggers.go:19   storage event   {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreif7o6bvtdkob6ntkm32r54thehlhghag2576xihfmhdushsua4aqm", "state": "StorageDealTransferring", "message": ""}
2020-09-06T13:35:48.560Z        ^[[34mINFO^[[0m graphsync       impl/graphsync.go:244   Graphsync ReceiveError: stream reset

jennijuju commented 4 years ago

@hannahhoward Not sure if its related or not, but one of the miners has the following config is having t he same error


[Storage]
  ParallelFetchLimit = 8
  AllowAddPiece = false
  AllowPreCommit1 = false
  AllowPreCommit2 = false
  AllowCommit = false
  AllowUnseal = true
···

hannahhoward commented 4 years ago

@raulk the whole messagequeue class should prevent multiple messages being written by multiple threads. Essentially, all messages outgoing to a single peer pipe through the message queue, which sends them sequentially in a single, isolated thread.

vyzo commented 4 years ago

this might be caused by oversize messages; we use a 1MB limit pretty much everywhere in libp2p, so this might explain the issue.

raulk commented 4 years ago

Judging by where the first sign of an error appears (Graphsync ReceiveError: stream reset), I don't think this is related to oversized protocol messages. We are initializing a gogo length-delimited protobuf reader with a buffer of 4mb, on the receiver side. But this first fails on the receiver side with a "stream reset", not a "short buffer" error, which is what we'd expect to see if the root cause was an oversized message.

If the first error we observe is a "stream reset", it's probably a yamux stream RST sent by the over end. I'm digging into the places where we explicitly reset the stream on the sender side, to try to get to the bottom of this.

raulk commented 4 years ago

@danny-wu I'd like to take a look at the logs on the dealbot side (sender). Mind posting your miner ID, or slacking it to me via DM on the Filecoin slack?

raulk commented 4 years ago

@hannahhoward I think I may have found the culprit here. I can't claim I understand the nitty-gritty details of go-graphsync, so pardon me if I get something wrong here.

It looks like chunking is happening on a higher level, at the "response builder" level.

https://github.com/ipfs/go-graphsync/blob/bd2d62f57b0f1590dd657ff8b1f3e08848f3f8a4/responsemanager/peerresponsemanager/peerresponsesender.go#L353

If the sum of the block size of the previous response and the incoming one is greater than 512kb, we start another response builder, instead of appending the response.

At one point, we call sendResponseMessages, which ends up looping over all response builders and calling SendResponse on each:

https://github.com/ipfs/go-graphsync/blob/bd2d62f57b0f1590dd657ff8b1f3e08848f3f8a4/responsemanager/peerresponsemanager/peerresponsesender.go#L395-L411

This goes into the PeerMessageManager:

https://github.com/ipfs/go-graphsync/blob/bd2d62f57b0f1590dd657ff8b1f3e08848f3f8a4/peermanager/peermessagemanager.go#L42-L47

Which calls AddResponses for each individual response.

Here is where the issue is:

https://github.com/ipfs/go-graphsync/blob/bd2d62f57b0f1590dd657ff8b1f3e08848f3f8a4/messagequeue/messagequeue.go#L67-L80

Each call to AddResponse ends up coalescing the response into a single serializable response, which probably blows up the protobuf message over 4mb.

When the dealbot tries to write the message on the wire, it succeeds. But as soon as @danny-wu's miner receives the length-prefix of the protobuf message, it notices it's over 4mb (the configured limit for the gogo protobuf reader), and it resets the stream.

https://github.com/ipfs/go-graphsync/blob/bd2d62f57b0f1590dd657ff8b1f3e08848f3f8a4/message/message.go#L262-L265

I think this situation is likely to happen when congestion control has kicked in on the receiver side (@danny-wu's miner; have you been monitoring your bandwidth utilisation?). So a message is being actively written on the wire, but a bunch of queries are coming in and queuing up more data, all of which eventually ends up in one big fat protobuf which fails to write.

P.S.: There's quite a bit of indirection and abstraction in go-graphsync, which we may be able to simplify going forward; I found this codebase was very hard to follow!

hannahhoward commented 4 years ago

@raulk oy. you're right what fail. will fix tomorrow.

hannahhoward commented 4 years ago

@raulk follow-up: I'm not sure so sure.

If you look at the loop here:

https://github.com/ipfs/go-graphsync/blob/bd2d62f57b0f1590dd657ff8b1f3e08848f3f8a4/responsemanager/peerresponsemanager/peerresponsesender.go#L395-L411

note the last 4 lines:

    // wait for message to be processed 
    select { 
    case <-done: 
    case <-prs.ctx.Done(): 
    }

I now remember writing this part specifically to prevent the problem you're talking about. Done won't read until the actual message is sent on the wire.

Obviously, experimentally, something is wrong here, so maybe we need to investigate further.

And I agree, the message coalescing part of the message queue at this point really isn't doing much other than coalescing requests and probably could be simplified considerably.

raulk commented 4 years ago

@hannahhoward Right! There's something weird here, though. I think the assumption that this channel fires when the message is written on the wire is incorrect.

Take a look at these lines. We close the channel as soon as we pick the message to send, not when it is actually sent on the wire (that happens later).

https://github.com/ipfs/go-graphsync/blob/bd2d62f57b0f1590dd657ff8b1f3e08848f3f8a4/messagequeue/messagequeue.go#L137-L143

The loop you point out to will iterate before the message is written on the wire. I don't know if this is the problem or not, but worth looking deeper into.

hannahhoward commented 4 years ago

Sorry it's not so much that it's on the wire, as that it's removed from adding more things to it. See immediately preceeding lines:

https://github.com/ipfs/go-graphsync/blob/bd2d62f57b0f1590dd657ff8b1f3e08848f3f8a4/messagequeue/messagequeue.go#L135-137

danny-wu commented 4 years ago

Hi Raul, my ID is t014365

I have 5 machines. 1 runs the lotus node and the lotus miner, the rest are workers. These are connected over a gigabit TL-SG108 switch; which is then plugged into my router.

During mining, sometimes large (~500GB) PC1 files get transferred from worker to worker. Thus, the switch is maxed out, and I can imagine congestion control kicks in.

Anecdotally, recent stream resets have seemed to coincide with when a transfer is happening, and network I/O is being maxed out.

I also wonder if ping/latency is a component here. Anecdotally, I've noticed reports of this from Oceania and China more, and we tend to have relatively high latencies to where dealbot is hosted.

hannahhoward commented 3 years ago

Closing as out of date and I believe addressed by several improvements made since first filed.

ipfs / go-graphsync

GraphSync: MQ attemptSendAndRecovery retries have malformed length #90

Issue

Ruling out my internet connection