chunker: infinite loop in the Rabin chunker

ipfs / boxo

A set of reference libraries for building IPFS applications and implementations in Go.

https://github.com/ipfs/boxo#readme

Other

210 stars 91 forks source link

chunker: infinite loop in the Rabin chunker #353

Open MichaelMure opened 4 years ago

MichaelMure commented 4 years ago

At Infura we got some of our nodes ending in an infinite loop in the Rabin chunker, with the CPU at 100% until something bad happen (the context doesn't reach there so loop keep spinning even though the request is long gone).

Here is a profiling: profile001

Original requests are add with the following parameters: [chunker=rabin-262144-524288-1048576,encoding=json,hash=sha2-256,inline-limit=32,pin=false,progress=true,stream-channels=true,]. Some also have trickle=true but that doesn't seem to matter.

Profiling lead to an infinite loop in https://github.com/whyrusleeping/chunker/blob/master/chunker.go#L209-L336 but it's unclear to me what is happening. if err == io.ErrUnexpectedEOF { err = nil } is suspicious but that kinda fly over my head.

welcome[bot] commented 1 year ago

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review. In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment. Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

"Priority" labels will show how urgent this is for the team.
"Status" labels will show if this is ready to be worked on, blocked, or in progress.
"Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

MichaelMure commented 4 years ago

Btw, this is with go-ipfs 0.4.22 so v0.0.1 of this package, but that's the same github.com/whyrusleeping/chunker as master.

ribasushi commented 4 years ago

@MichaelMure are you by chance able to share an input ( publicly or privately ) that reliably reproduces this? Or is the problem intermittent and the same input might hang a daemon once, and then pass through just fine on rerun?

MichaelMure commented 4 years ago

Unfortunately, that's all the information I have. It seems to be fairly rare, but to be fair, not many people play with the Rabin chunker.

ribasushi commented 4 years ago

@MichaelMure understood. It definitely sounds like a missing underflow protection, but it will take me another couple days to get my test setup working well to pin this down. I should have something for you Thu-ish

Stebalien commented 4 years ago

Note: our current rabin implementation is terrible. I wouldn't waste too much time trying to fix it.

The buzhash implementation in go-ipfs master should work much better.

MichaelMure commented 4 years ago

Will it get deprecated/dropped ? Because if someone can take out our nodes with a bad request, that's not so fun.

Stebalien commented 4 years ago

Probably not? It's just a low-priority issue given that we support an additional chunker and are working on yet another chunker.

If you need it fixed now, I'm happy to accept a patch. However, the HTTP API is not designed to be a public API so this issue is significantly less critical for the IPFS team.

MichaelMure commented 4 years ago

I understand the low priority. My question is more if you plan to retire Rabin once Buzhash is there for good.

ribasushi commented 4 years ago

@MichaelMure there's a sustained effort over the next ~month to better quantify what exactly do we want from a CDC, and where our actual bottlenecks are. Please track the main text of https://github.com/ipfs/specs/issues/227#issue-532891529 for further updates.

ribasushi commented 4 years ago

@MichaelMure just a quick update on this ticket: this is almost certainly a ( but not 100% proven ) bug in the way the rabin chunker implementation is reading from the stream, resulting in an unexpected under-read and thus a loop for data that never comes.

Since this issue got opened originally, we have developed a different subsystem providing a unified read-interface for the rest of the ingestion pipeline, namely https://pkg.go.dev/github.com/ipfs/go-qringbuf@v0.0.0-20200430132832-6a3bc585b0d9?tab=doc#pkg-overview

This bug will be solved in the process of migrating to that system.

MichaelMure commented 4 years ago

Awesome thanks for the update :)