storj / uplink

Storj network Go library
MIT License
115 stars 18 forks source link

Bad upload performance / no long tail cancelation in retry situation like too many offline nodes #164

Open littleskunk opened 6 months ago

littleskunk commented 6 months ago

We noticed a performance issue in the storj select network. One of the storage node providers had issues. We believe a good number of nodes have been offline for an hour or so. During that time the upload performance was impacted. We didn't have a good explaination why so Igor and me reproduced it on the QA satellite. In total, the QA satellite has 146 nodes available for uploads. I took 40 nodes offline to create a similar situation as in the storj select network. Igor uploaded a file with an old and a new uplink version. The behavior of both uplink versions was the same.

Here is an example log: olduplink.txt

The upload took a supprising long time. The same issue as we observed in production. Here is the timeline: Round 1: Uplink connects to 110 nodes. 33 nodes are offline. It takes just 1 second to notice that. 76 successful uploads after just 10 seconds.. The real problem is one slow node that takes a minute before the uplink errors out. -> Retry means no longtail cancelation and we wait for the slowest node to finish or fail. Round 2: 34 pieces to upload. 8 connection errors after 1 second. 4 successful uploads. Upload finished with 80 pieces. This all took again just a few seconds.

The gap between our expectation (should finish in seconds) and the current implementation (takes at least a minute if not more) is that the offline nodes error out fast but the retry is slowed down by the slowest node in the mix. So a single slow node in combination with too many offline nodes can destroy the performance.

Would it be possible to kick of the retry for the offline nodes more or less right away without waiting for the slow node? Or a more agressive long tail cancelation that also triggers if successful + failed uploads > 80 (+ 10 safety threshold or so to avoid false positives). In this situation it would have canceld the slow node after 10 seconds instead of waiting more than a minute.

iglesiasbrandon commented 3 months ago

@littleskunk we are not sure if this issue is still relevant or not. We think we implemented some code changes that might have resolved it. can you take a look and let us know?