Chunker Memory leak - Githubissues

ipfs / boxo

A set of reference libraries for building IPFS applications and implementations in Go.

https://github.com/ipfs/boxo#readme

Other

197 stars 86 forks source link

Chunker Memory leak #647

Open Xib1uvXi opened 1 month ago

Xib1uvXi commented 1 month ago

When I add a folder with many small files to kubo, memory usage always spikes abnormally

Total folder size: 400MiB
Number of files: 10000
Average size per file: 40KiB

I located the exception by analyzing the heap


// NewSizeSplitter returns a new size-based Splitter with the given block size.
func NewSizeSplitter(r io.Reader, size int64) Splitter {
    return &sizeSplitterv2{
        r:    r,
        size: uint32(size),
    }
}

// NextBytes produces a new chunk.
func (ss *sizeSplitterv2) NextBytes() ([]byte, error) {
    if ss.err != nil {
        return nil, ss.err
    }

    full := pool.Get(int(ss.size))
    n, err := io.ReadFull(ss.r, full)
    switch err {
    case io.ErrUnexpectedEOF:
        ss.err = io.EOF
        small := make([]byte, n)
        copy(small, full) 
        pool.Put(full)  
        return small, nil
    case nil:
        return full, nil
    default:
        pool.Put(full)
        return nil, err
    }
}

// Reader returns the io.Reader associated to this Splitter.
func (ss *sizeSplitterv2) Reader() io.Reader {
    return ss.r
}

Q1: Will using a pool result in abnormal memory consumption when the file size is much smaller than the chunk size?

welcome[bot] commented 1 month ago

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review. In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment. Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

"Priority" labels will show how urgent this is for the team.
"Status" labels will show if this is ready to be worked on, blocked, or in progress.
"Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

gammazero commented 1 month ago

Will using a pool result in abnormal memory consumption when the file size is much smaller than the chunk size?

The pool tries to allocate buffers sized to the power of 2 closest to the chunk size. A small file causes a new allocation, and the pool buffer is returned to the pool.

        small := make([]byte, n)
        copy(small, full) 
        pool.Put(full)

Returning the buffer to the pool does not necessarily free it and may keep it in the pool. A large number of files will cause these allocations for each when reading the last partial chunk. So I not think it has to do with small files, but the number of files.

It may be better to keep the pool-allocated buffer, and avoid the extra allocation and copy:

        small := full[:n]

or use the pool to allocate the smaller buffer:

        small := pool.Get(n)
        copy(small, full)
        pool.Put(full)

Having fewer different buffer sizes might help with GC. Will test to see if this makes any difference.

gammazero commented 1 month ago

accidental close - repoened

Xib1uvXi commented 1 month ago

I have reimplemented a Splitter and modified some kubo related code. Under scenarios with a large number of files, memory usage has stabilized and no abnormal growth has been observed.

chunk

func NewAutoSplitter(r io.Reader, size int64, fileSize int64) chunk.Splitter {
    ss := &AutoSplitter{
        r:        r,
        size:     size,
        usePool:  true,
        fileSize: fileSize,
    }

    if fileSize >= size {
        ss.mode = normal
    }

    if fileSize < fileSize {
        ss.mode = small
        ss.buffer = make([]byte, fileSize)
    }

    return ss
}

kubo, core/coreapi/unixfs.go

// Constructs a node from reader's data, and adds it. Doesn't pin.
func (adder *Adder) add(reader io.Reader, fileSize int64) (ipld.Node, error) {
    chnk := chunker.NewAutoSplitterV2(reader, chunkSize, fileSize)

    params := ihelper.DagBuilderParams{
        Dagserv:    adder.bufferedDS,
        RawLeaves:  adder.RawLeaves,
        Maxlinks:   ihelper.DefaultLinksPerBlock,
        NoCopy:     adder.NoCopy,
        CidBuilder: adder.CidBuilder,
    }

    db, err := params.New(chnk)
    if err != nil {
        return nil, err
    }
    var nd ipld.Node
    if adder.Trickle {
        nd, err = trickle.Layout(db)
    } else {
        nd, err = balanced.Layout(db)
    }
    if err != nil {
        return nil, err
    }

    return nd, adder.bufferedDS.Commit()
}

gammazero commented 1 month ago

@Xib1uvXi I would be very interested to see what this looks like for you after running a while: https://github.com/ipfs/boxo/pull/649

Otherwise, avoiding using the pool in any case may end up being more efficient. I think it may be the case for many files, whether large or small since the extra allocation is caused by a partially filled chunk at the end of the file.

That PR includes a benchmark to compare a chunker that uses go-beffer-pool pool against one that does not. The benchmark simulates your use case of 10000 files with sizes varying between 20K and 60K bytes.