vstinner / hachoir

Hachoir is a Python library to view and edit a binary stream field by field
http://hachoir.readthedocs.io/
GNU General Public License v2.0
604 stars 70 forks source link

Q: Extracting files from Win32 Cabinet Self-Extractor? #65

Closed brechtm closed 3 years ago

brechtm commented 3 years ago

I'm attempting to extract the Microsoft Core Fonts for the Web in Python. From the How to extract a windows cabinet file in python StackOverflow question, I learned about hachoir.

hachoir happily parses the self-extracting exe file, and I was able to extract something (/section_rsrc stream) that is happily accepted by cabextract. However, hachoir won't parse it:

>>> cab = createParser('rsrc.cab')
[warn] Skip parser 'CabFile': Invalid magic

Stripping all data before the CAB header using a hex editor, I can get hachoir to parse the CAB file. Does hachoir offer the means to extract this CAB file, without leading/trailing data? Or is that something that I need to look up in a Microsoft specification document?

Same question for extracting the files from the CAB file; does hachoir offer the abstraction level to do this?

Thanks!

brechtm commented 3 years ago

The size of cab['folder_data[0]'].uncompressed_data matches the sum of the sizes of the files. However, it is a string while it should obviously be bytes for binary files (such TTF files). I suspect I need to encode this string using a particular 8-bit encoding to be able to write out the files, but I haven't yet found out which. _latin1 encoding produces a corrupt TTF file that can be opened by Font Book in macOS, but it's still different from the TTF produced by cabextract.

vstinner commented 3 years ago

Someone should enhance the validate() function of hachoir/parser/archive/cab.py to accept your CAB file. According to your error message, it seems like your CAB archive doesn't start with the 4 bytes: MSCF.

brechtm commented 3 years ago

I think _latin1 encoding uncompressed_data is indeed the way to go here, since the very first version of Unicode used the code points of ISO-8859-1 as the first 256 Unicode code points. [Wikipedia]. However, as stated above, the resulting TTF is corrupt.

Examining this closer, I see that the TTF only differs from the known good version in two bytes (though other extracted files differ in more bytes). I have not been able to determine the cause of this, but I suspect that there is a bug in the LZX decompression code. That's not unlikely, since there aren't any tests for it and the LZX algorithm specification is known to have some errors.

I'd love to track down and fix this bug, but the use case doesn't allow for spending more time on this problem, unfortunately.

Someone should enhance the validate() function of hachoir/parser/archive/cab.py to accept your CAB file. According to your error message, it seems like your CAB archive doesn't start with the 4 bytes: MSCF.

I think there are probably other files stored in the /section_rsrc besides the CAB, so I would need to somehow get the offset/length from the other fields.

nneonneo commented 3 years ago

In the .exe, it seems like one of the raw_res[] entries contains the file you want. For example, in arial32.exe, the contents of /section_rsrc/raw_res[1] contains the .cab file exactly.

The issue with uncompressed_data being a string is due to missing the lzx module when moving from Python 2 to 3. It should not be a string; I will devise a fix. Thanks for the report!

nneonneo commented 3 years ago

Secondly, it looks like I forgot to handle the Intel jump fixups in LZX. This has now been fixed in #66. Thanks for bringing it to my attention.

nneonneo commented 3 years ago

With #66 applied, the following code successfully extracts the files correctly from arial32.exe:

from hachoir.parser.program import ExeFile
from hachoir.parser.archive import CabFile
from hachoir.stream import FileInputStream
from io import BytesIO

f = FileInputStream("arial32.exe")
exe = ExeFile(f)
rsrc = exe["section_rsrc"]
for content in rsrc.array("raw_res"):
    # get directory[][][] and corresponding name
    # this is a bit hacky, ideally API would provide this linkage directly
    directory = content.entry.inode.parent
    name_field = directory.name.replace("directory", "name")
    if name_field in rsrc and rsrc[name_field].value == "CABINET":
        break
else:
    raise Exception("No CABINET raw_res found")

cabdata = content.getSubIStream()
cab = CabFile(cabdata)
# request substream to force generation of uncompressed_data
cab["folder_data[0]"].getSubIStream()
folder_data = BytesIO(cab["folder_data[0]"].uncompressed_data)
for file in cab.array("file"):
    with open(file["filename"].value, "wb") as outf:
        outf.write(folder_data.read(file["filesize"].value))
brechtm commented 3 years ago

@nneonneo Many thanks for the fixes and the sample code. Highly appreciated!

Looks like you figured out what was wrong with the LZX decompression code very quickly. Sure, you worked on that code 10 years ago, but still. 😁