segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 103 forks source link

Question: Too much memory to write parquet files #412

Open ycyang-26 opened 1 year ago

ycyang-26 commented 1 year ago

When I write a new parquet file(about 100 MB data before compression), 1GB of memory is requested. I'm wondering that why it takes up too much memory. I use the following method to write file and the struct of data is as bellow:

err := parquet.Write[*model.x](data, rawParquetData, compressionType)

type x struct {
    x    int    `parquet:"x,delta"`
    x    string `parquet:"x,dict"`
    x    string `parquet:"x,dict"`
    x    string `parquet:"x"`
    x    int    `parquet:"x"`
    x    int    `parquet:"x"`
    x   string `parquet:"x,dict"`
    x   string `parquet:"x,dict"`
    x   string `parquet:"x"`
    x   int    `parquet:"x,dict"`
    x   []string `parquet:"x,list"`
    x   string   `parquet:"x,dict"`
    x   []string `parquet:"x,list"`
}
kevinburkesegment commented 1 year ago

Are you familiar with using pprof to profile a Go program? The first thing we would do is try to reproduce the results you're seeing and then analyze the quantity and size of memory allocations, but given you can reproduce it consistently, it may be easier for you to generate a profile. https://pkg.go.dev/runtime/pprof

ycyang-26 commented 1 year ago

Are you familiar with using pprof to profile a Go program? The first thing we would do is try to reproduce the results you're seeing and then analyze the quantity and size of memory allocations, but given you can reproduce it consistently, it may be easier for you to generate a profile. https://pkg.go.dev/runtime/pprof

Actually, I have used pprof to analyze the program. The alloc space of parquet.Write is as follows. The alloc space of red file is 700MB.

image
vbmithr commented 1 year ago

Had similar issues, and were never able to determine where the memory went even with those traces. Never managed to fix them, either. See #118

Indeed playing with GOGC environment variables did seem to help a bit, which indicate maybe that the problem lies in how the go runtime deal with garbage collection rather than this library. Definitely not an expert in this. But eventually, the amount of RAM needed was on the order of magnitude of the combined size of all the data (uncompressed!) that had to go in the file, whereas I thought that you could theoretically write a parquet file using very little memory by flushing things often.

kevinburkesegment commented 1 year ago

Thank you, that's really helpful.

achille-roussel commented 1 year ago

I believe the issue may come from using append, which stops exponentially increasing the slice capacity for large slices (around 1MiB if I remember correctly). This results in reallocating memory buffers that grow very slowly, greatly increasing the memory footprint.

We could try modifying the plain.AppendByteArrayString method to always manually grow the slice capacity by 2x, which would better amortize.

I'm also curious whether you are calling parquet.Write repeatedly in your application (e.g. to produce multiple parquet files). If that's the case, you might be able to gain much greater memory efficiency by reusing a parquet.GenericWriter instead.