vstinner / hachoir

Hachoir is a Python library to view and edit a binary stream field by field
http://hachoir.readthedocs.io/
GNU General Public License v2.0
604 stars 70 forks source link

Truncated jpeg memory use #67

Closed cccs-jh closed 3 years ago

cccs-jh commented 3 years ago

A truncated jpeg can have a JpegImageData field with no terminator, which is created without a known size. Because the size isn't known the corrupted JpegImageData must be parsed in full to calculate its size when the field is added to its parent JpegFile during JpegFile parsing. This forces simple operations that don't care about the JpegImageData, like checking if a field with a given name is in the JpegFile, to parse the corrupted JpegImageData fully. Parsing the corrupted section can blow up the memory use of the parser as it tries to parse the entire rest of the file in small chunks.

An example file that causes this issue can be found here: https://github.com/CybercentreCanada/assemblyline-service-characterize/issues/12. This jpeg truncated to 500 000 bytes consumes approximately 1 GB of memory parsing JpegHuffmanUnits until it reaches the end of the file and errors. This happens when extracting metadata, or whenever a checking for a field name in the jpeg that isn't there.

vstinner commented 3 years ago

What do you suggest? Limit Hachoir memory? Don't parse a JPEG file if it's truncated?

vstinner commented 3 years ago

Oh, I noticed afterwards that you proposed PR #68 to fix it. I merged your fix, thanks.

vstinner commented 3 years ago

It's also good to see that Hachoir is used ;-)