Poor performance when opening raw data with (wrong) gzip encoding in header

AAAArcus commented 5 years ago

Manually putting a field 'encoding'='raw' in the header overrides any compression_level>0 input to nrrd.write(), which I guess is the way it was intended to be. And when setting 'encoding'='gzip' and writing a file with compression_level=0, the output is uncompressed but the header is unmodified. This would not be an issue if it wasn't for the fact that reading a raw data file with 'encoding'='gzip' in the header (which it should not have of course) seems to take much longer. In my example with a 600MB raw data, writing with compression_level=0 and 'encoding'='raw' in the header gives one 600MB file, and writing again, just changing to 'encoding'='gzip' in the header, gives another 600MB file. However, reading the latter with nrrd.read() takes 300(!) times longer than reading the first.

There is probably and explanation if you look into the details, but it feels like if compression_level=0, the encoding should be set to 'raw' in the header automatically on write?

addisonElliott commented 5 years ago

Hm, this is an interesting effect.

I'm a bit hesitant to include special functionality to intercept when the compression_level is 1 and set the encoding to raw. However, maybe if we dig deeper into why the compression_level=0 takes longer then I could be swayed to include custom functionality.

This seems like a special use case. Is there a reason why you need to have a compression_level=0?

If you have a special use case where this is deemed necessary or advantageous, then I'm fine with what you suggest to do. However, it could be worthwhile to investigate where the bottleneck is from. There are two options as far as I can tell:

Decompression object (see here)
np.fromstring (see here)

AAAArcus commented 5 years ago

Haven't done any digging yet but will try to do so. To clarify, I personally think that the encoding field should tell you if the data is compressed or not (and if it's gzip or whatever). Maybe this is against "the standard"?

From a user perspective, setting compression_level is how you control the size/performance tradeoff. In my case, if I want maximum performance I would set compression_level=0 (without fiddling with the header). This caused the encoding to remain as "gzip", which would have been fine by me if it wasn't for the crazy performance LOSS when loading uncompressed. I don't know if it's common to use uncompressed nrrd, but I think this should be seen as a really useful feature when data loading is a bottleneck (when using compressed data).

AAAArcus commented 5 years ago

Ok, I did some testing, and it is the decompression object (alternative 1) that is the problem. Since the data is split in 1MB chunks, the reading loop runs about 600 times for me, and the time for each loop increases. First loop it's about 2-3ms and last loop 385ms for running line 409. I assume the memory is reallocated when growing the immutable bytes object "decompressed_data".

Growing a bytearray instead solves the problem, but maybe there is a better way? np.fromstring() still wants immutable bytes as input. Also, larger chunk size makes things a lot better. Maybe consider to use something larger than 1MB, or adaptive to filesize?

This comment makes me think there is room for improvement even by some minor changes.

addisonElliott commented 5 years ago

Perfect, I was doing some testing as well and got the same results.

An interesting thing I found is that bzip2 does not support a compression level of 0, only 1-9. Maybe we restrict compression level to be 1-9 and throw an error if the user sets it to 0. Then we document appropriately.

However, I think you've stumbled across an interesting find that maybe we should address. The 1MB chunks are before my time so I'll need to check out the commit as to why they were added. I would surely think this parameter could be much larger, like 50-100MB before causing problems.

Edit: Just hacked it to make the size 100MB and it reduced the amount of time to load any compressed data by 50-100x.

This is something worthwhile to look into. We should set this to an appropriate value and/or let it be adjustable.

addisonElliott commented 5 years ago

Here's the commit: https://github.com/mhe/pynrrd/commit/c98ac3c3bc7f24be0d20d8d0b9da1ff928aca94c

Here's the issue: #21

Not sure if this issue is related since it doesn't mention anything about the CRC check: https://bugs.python.org/issue25626

As far as I know, we should be able to set this limit to 4GB and be safe. I think it's a safe assumption pynrrd is going to be running on a system with an unsigned int being at least 32-bits.

We need some way to test this adequately without having to save a 4GB file. The test should generate a > 4GB file and then read from it. If this takes too long, then best not to have this done on TravisCI.

AAAArcus commented 5 years ago

Great, seems we can improve the performance overall, and doing something about the header['encoding'] and compression_level handling is a separate problem.

Still, I believe fixing the current improper memory handling is still important, even if the performance problem is less significant (or insignificant) with much larger chunks.

Seems using io.BytesIO is the fastest, but it requires additional import. Keeping chunks of 1MB I can reduce load time from 120 sec to 4 sec, in my specific case, by using io.BytesIO as a buffer.

addisonElliott commented 5 years ago

Alright, so we have three different problems we're analyzing. Let me start with summarizing the problem and give my proposed solution for each problem so we can systematically analyze this.

Problems:

compression_level=0 with encoding=gzip results in no compression at all. Same result as encoding=raw but takes longer.
- compression_level=0 is not supported for bzip2 however.
_READ_CHUNKSIZE parameter reads the data in 1MB chunks, severely slowing down performance with compressed data.
decompressed_data (here) is repetitively dynamically reallocated causing additional processing time.

Here are the solutions I think we should implement for the problems, let me know your opinion:

Restrict compression_level to be between 1-9. This way if the user wants raw data to be saved, they have to set encoding=raw.
Set _READ_CHUNKSIZE and _WRITE_CHUNKSIZE to 4GB. This may need to be slightly less than 4GB, not sure. Will need to test this with a 4GB+ file to ensure it works correctly. Especially on Python 2.7 because that is the environment the original issue was reported (#21)
This is the issue we're still discussing now. Not sure the best solution yet.

I don't know the specifics to how appending to a bytestring versus writing to BytesIO works. That could be illuminating into why one is faster over the other. Since we don't know the exact size of the decompressed data, we can't preallocate exactly what we need. Rather, I know the std::vector class in C++ does an approach where they allocate double the previously allocated amount when the internal size is hit. First allocate 1 byte, 2 byte, 4 byte, 8 byte, 16 byte, 32 byte, 64 byte, etc... Results in significantly less memory allocations and moving of data.

May be a good idea to do some research behind these methods and see what others are doing in the Python community for reading buffered data. BytesIO may well be the preferred method.

If so, why can't we use np.fromfile rather than np.fromstring since the BytesIO object will operate similarly to a file?

addisonElliott commented 5 years ago

I've done some research. This might be an easy really fix to make...

Refer to this article for details on the what is the most efficient method to concatenate multiple byte strings together: https://stackoverflow.com/questions/12169839/which-is-the-preferred-way-to-concatenate-a-string-in-python

The second answer by jdi was helpful because he did some benchmarking. In addition, he references the Python 3 FAQ saying that for concatenating bytearrays, the best way is just simply using += which sounded odd to me. See here: https://docs.python.org/3/faq/programming.html#what-is-the-most-efficient-way-to-concatenate-many-strings-together

I made a Jupyter notebook to do some testing. Here's an image of it here with the results too: Untitled2.pdf

In summary, using += is slightly faster than BytesIO if using a bytearray not a bytes object (b''). Bytearray is mutable, bytes (b'') is immutable. (See here).

So, the proposed solution is just to change line 400 from b'' to bytearray()...

I deleted my test script. Can you make that change and test again please? I'm thinking this will greatly improve reading speeds.

AAAArcus commented 5 years ago

I basically agree.

I still think it's a bit cumbersome to have to change the header entry instead of setting compression_level, but the proposed change makes sense and is minimally invasive. Although it will break some code.
Increasing the buffer to, lets say 100MB, would be a huge boost in performance, but maybe exploring this further is better than setting a new arbitrary chunk size.
Running my use case I still get better performance from bytesIO (4.6 sec) compared to bytearray (5.7 sec). But since this is not consistently observed, and bytesIO requires import, I think the += solution (essentially keeping the current code) is the best. However, np.fromstring() on line 413 need to be changed to data = np.fromstring(bytes(decompressed_data[byte_skip:]), dtype). Maybe extra conversion is causing bytesIO to win in my tests.

Regarding the np.fromfile, that could be possible, or maybe np.frombuffer is more suitable since it is a buffer in memory (that acts like a file). I actually used np.frombuffer but I don't think there is anything to gain here in terms of performance.

addisonElliott commented 5 years ago

You're right that it will break some code and that's a good point to make. However, my thought process that using compression_level=0 is really an unintentional side-effect. One reason that makes this stand out is that compression_level=0 only works for gzip and not bzip2. It is a bit more cumbersome but I think being explicit about the encoding being raw might be worth the tradeoff. But, the fact that it will break compatibility makes me wonder what version this should be released under.
I'm not sure I follow. The ideal scenario would be to read the entire file in one chunk. This reduces the overhead in system calls to the OS for data and since the entire file will ultimately be placed in RAM anyway, there is no downside to reading it all in one shot. However, the limitation is that there was problems reading more than 4GB at once due to the CRC being 32-bit I believe. So, I'm saying the maximum (and therefore optimal) chunk size would be 4GB.
Ran a similar test case and there was a slight speed increase between the bytearray method vs. BytesIO. I tested using np.frombuffer and np.fromstring like you mentioned and can confirm that np.fromstring is slower than np.frombuffer. The np.frombuffer had ~7% increase in computation time which is minimal. Plus, since the chunk size will be increased (point 2), this will result in this mattering less and less.

Points 1 & 3 we are in agreement on. Point 2 still needs some more discussion. In terms of PRs, I think point 1 should be its own PR and then points 2 & 3 can be put into one PR named something like "Fixing performance issues in compressed NRRDs".

For point 1, we will want to make sure to update the documentation. Once that is submitted, I'll leave it up for some others to review in case it will break compatibility for them.

For points 2 and 3, we can get one started for point 3 and then add in point 2 after we finish discussing it. The 3rd point is a simple fix. However, the 2nd point will need to include some tests for larger datasets potentially...

Are you able to submit PRs? If not, I can get to them at some point.

addisonElliott commented 5 years ago

As for point 2, I just tested with a chunk size of 4GB (4 * 2 ** 30) and it worked fine on Python 3 and Python 2. Honestly, I'm not sure if buffered reading is required since the original issue was with the gzip library not the zlib library and there is a bug report that was fixed regarding reading the memory in chunks internally. However, I think the best course of action is to set it to 4GB and get some feedback. I'm thinking we can forego tests since it will be difficult to manage a > 4GB file.

addisonElliott commented 5 years ago

@AAAArcus Bump

Any more thoughts? If not, are you able to submit PRs or would you like me too?

No rush, just want to keep the conversation going...

addisonElliott commented 5 years ago

@AAAArcus PRs are made. Let me know what you think and review them please.

AAAArcus commented 5 years ago

@addisonElliott Sorry for dropping off the radar. Haven't been able to work for more than a week... Looks good to me! Thanks for all the work!

Makes sense to remove the compression level zero without actually breaking functional code (but perhaps currently rather slow). The performance fix will make this a minor issue even for those who still run with compression_level=0.

Didn't notice the difference in how the writer handled the chunks compared to the reader. Reasonable that they look more symmetric. I agree that the RAM issues shouldn't be major nowadays. But maybe the memory of the compressed data could be released piece by piece to avoid storing both the compressed and decompressed data in memory at the same time? Not sure how though...

addisonElliott commented 5 years ago

No problem. This weekend was going to be the most free time I'll have the next few weeks so I wanted to get some things done on my TODO list.

Yeah, the reason for the asymmetry is likely due to the intricacies of the GZIP compression algorithm, something I'm not too familiar with.

That's not a bad idea to release the memory piece by piece. Another option is to figure out why fh.read(1GB) is MUCH slower on small files but fh.read() doesn't take long at all. That was the main reason why I didn't keep the buffering in the read I/O.

However, my opinion is to keep everything the way it is until someone runs across an issue. I've found that if I try to optimize for every single use case, it ends up taking forever!

I think I mentioned this in the PR, but my justification is that likely the compressed data is much smaller than the decompressed data. Thus, if the user is able to have the entire set of decompressed data in memory, then the RAM required is decompressed data size + compressed data size which shouldn't be too much more than the RAM required just for decompressed data.

Whenever I get some free time this week, I'll merge the PRs and close this issue. Thanks for reporting it and helping come to a solution!

AAAArcus commented 5 years ago

I guess the use case where the memory problem could show up would be if the uncompressed data is basically taking up all the memory, so the compressed would not fit along side it. But that shouldn't be too common. Usually you need some memory to process the uncompressed data afterwards anyway, so fitting the compressed data should rarely be the main problem.

Looking forward to the new version (instead of my own hack).

mhe / pynrrd

Poor performance when opening raw data with (wrong) gzip encoding in header #88