S3 streaming has a bug - Githubissues

maoueh commented 11 months ago

It appears our streaming code for S3 (or maybe in the S3 library itself) has a bug that leads to weird file issue where the content is not read fully.

The bug seems to happen non-systematically, which makes me think it could be a wrong "error" handling when the stream closes unexpectedly.

This problem has been reported a few times over the past 1-2 years, against Ceph, S3 directly and SeaweedFS. The current workaround is to set DSTORE_S3_BUFFERED_READ=true which reads everything in one show in memory and then act as a io.Reader. This however creates memory pressure as the full file is held in memory before being streamed.

See https://github.com/streamingfast/firehose-core/issues/15 for some details, and some logs from SeaweedFS. We can see there that SeaweedFS sees internal failure but those leads later to Firehose trying to read corrupted blocks:

panic: unable to decode block #16651067 (fb80c53f0b9ad8a026d21cf9aab801e42ea6db209de86053fead6a751f8f6477) payload (kind: ETH, version: 3, size: 1047315, sha256: 0614c58482dfdd1ebfd10abda656531bd8b81e15852dc54138ad8e0f592e9f3c): unable to decode payload: proto: cannot parse invalid wire-format data

Payload: [OMITTING HUGE LINE OF BINARY DATA]

goroutine 345531 [running]:
github.com/streamingfast/bstream.(*Block).ToProtocol(0xc02376c500)
    /home/runner/go/pkg/mod/github.com/streamingfast/bstream@v0.0.2-0.20230510131449-6b591d74130d/block.go:246 +0x6e9
github.com/streamingfast/firehose-ethereum/transform.(*CombinedFilter).Transform(0xc0003b05f8?, 0x681d52?, {0x64c0f1?, 0x3476c60?})
    /home/runner/work/firehose-ethereum/firehose-ethereum/transform/combined_filter.go:185 +0x36
github.com/streamingfast/bstream/transform.(*Registry).BuildFromTransforms.func1(0xc01fe9be00)
    /home/runner/go/pkg/mod/github.com/streamingfast/bstream@v0.0.2-0.20230510131449-6b591d74130d/transform/builder.go:82 +0x1d3
github.com/streamingfast/bstream.(*FileSource).preprocess(0xc001bcf5c0, 0xc01fe9be00, 0xc0404bb860)
    /home/runner/go/pkg/mod/github.com/streamingfast/bstream@v0.0.2-0.20230510131449-6b591d74130d/filesource.go:506 +0x5b
created by github.com/streamingfast/bstream.(*FileSource).streamReader
    /home/runner/go/pkg/mod/github.com/streamingfast/bstream@v0.0.2-0.20230510131449-6b591d74130d/filesource.go:495 +0x68a

Which means someone the "consumer" saw a end of the stream but the actual reading code failed due to some missing bytes.

johnkozan commented 11 months ago

I have the same issue with my substreams tier2 instances. They would consume all available tcp_mem until the system crashed.

I've been poking around trying to figure this out for a while, and I think I've found the issue, the reader.Body is not closed here: https://github.com/streamingfast/dstore/blob/3924b3b36c778c14ef73ce108d965c774fb27fff/s3store.go#L335

I added a reader.Body.Close() and that seems to have fixed the issue, tcp_mem is down to nothing.

johnkozan commented 11 months ago

or am I wrong because it is closed later?

maoueh commented 11 months ago

It's closed later indeed, that the point of the OpenObject. But your investigation pointa to something in that vein, there is definitely something not closed properly while the streaming is happening.

We had a "sample" binary that was doing in loop OpenObject then tried to read the merged block, I'll try to find it back and set up the infra to reproduce the error more easily.

streamingfast / dstore

S3 streaming has a bug #11