unixfreak0037 / officeparser

Extract embedded files and macros from office documents.
http://empieria.com/blog/?page_id=10
MIT License
178 stars 60 forks source link

Python 3 compatibility #19

Open frenzymadness opened 5 years ago

frenzymadness commented 5 years ago

Hello.

I am trying o make this tool Python 3 compatible while keeping backward compatibility with Python 2.7. I've tested my work with three scenarios and one testing Word document. I am not a user of this tool so I just compared the output for Python 2 and 3 and it seems to be okay.

Tested commands:

officeparser.py --create-manifest --extract-streams test/test.doc
officeparser.py --dump-stream-by-name=WordDocument test/test.doc
officeparser.py --print-streams test/test.doc

If you find something missing, please provide a reproducer (shell command) so I can use it to test my work and backward compatibility.

Fixes: #18

xambroz commented 5 years ago

Thank you very much for help it is cool. I am adding 1 more patch for xrange, 1 more for the "to_hex" output where the usage of binary strings would clash, and one little cosmetics patch to get rid of the annoying error for not passing a filename when officeparser executed without parameters. Still testing.

frenzymadness commented 5 years ago

@xambroz I can give you commit rights to my repository so you can continue there and your commits appear here. What do you think?

xambroz commented 5 years ago

Thank you - that would work. I will add what I have.

Currently I have patches which make it work on plain office file.

The only thing which I know is not working yet is the extraction of macroes, but I hope to fix that as well.

xambroz commented 5 years ago

In the meanwhile - this is what I have to add at this point: https://github.com/frenzymadness/officeparser/pull/1

I know that --export-macros is not working in python3. Tested like this: 1) download malware sample xls with macros from hybrid-analysis.com https://www.hybrid-analysis.com/sample/8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f?environmentId=100

2) gunzip the file

gunzip 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin.gz

3) try with python2

python2 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 836, in _main
    buffer = StringIO()
NameError: global name 'StringIO' is not defined

4) try with python3

$ python3 $(which officeparser.py) --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin 
Traceback (most recent call last):
  File "/usr/bin/officeparser.py", line 1234, in <module>
    _main()
  File "/usr/bin/officeparser.py", line 835, in _main
    buffer = StringIO()
NameError: name 'StringIO' is not defined

Even including "from io import StringIO" is not directly fixing the situation: 5) try with python2 and io.StringIO

python2 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 837, in _main
    buffer.write(ofdoc.get_stream(project.index))
TypeError: unicode argument expected, got 'str'

6) try with python3 and io.StringIO

python3 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 837, in _main
    buffer.write(ofdoc.get_stream(project.index))
TypeError: string argument expected, got 'bytes'

The original (cStringIO.StringIO) gives this:

$ python2 officeparser.py.orig --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin 
INFO: Saving VBA code to ./Sem_1.cls
INFO: Saving VBA code to ./Page1_1.cls
INFO: Saving VBA code to ./Module1_1.bas
INFO: Saving VBA code to ./UserForm1_1.frm
INFO: Saving VBA code to ./Module2_1.bas
INFO: Saving VBA code to ./Module3_1.bas
INFO: Saving VBA code to ./UserForm6_1.frm
INFO: Saving VBA code to ./Page11_1.cls
INFO: Saving VBA code to ./Module6_1.bas
INFO: Saving VBA code to ./Module5_1.bas
INFO: Saving VBA code to ./Module4_1.bas
INFO: Saving VBA code to ./Class1_1.cls
INFO: Saving VBA code to ./Sheet1_1.cls
frenzymadness commented 5 years ago

I am investigating the macros extraction.

The very first question I need an answer for is whether PROJECT stream in a document should be handled as bytes or Unicode. Because now it's mixed and that's the reason why it does not work in Python 3. Do I understand it correctly that it contains some code in VB script so it should be handled as Unicode?

xambroz commented 5 years ago

Hello, yes PROJECT stream in Office documents seems to hold metadata about the macros in the plaintext form in the INI format.

$ python2 officeparser.py.orig --dump-stream-by-name PROJECT word_form.doc 
ID="{F71D9A8C-3763-458D-A309-7E5E41C49A1A}"
Document=ThisDocument/&H00000000
Module=NewMacros
Name="Project"
HelpContextID="0"
VersionCompatible32="393222000"
CMG="C1C327AD2BAD2BAD2BAD2B"
DPB="828064A724A824A824"
GC="4341A5E667E767E798"

[Host Extender Info]
&H00000001={3832D640-CF90-11CF-8E43-00A0C911005A};VBE;&H00000000
&H00000002={000209F2-0000-0000-C000-000000000046};Word8.0;&H00000000

[Workspace]
ThisDocument=46, 46, 678, 454, 
NewMacros=69, 69, 678, 506, Z
frenzymadness commented 4 years ago

Hello.

Unfortunately, I don't have the capacity to work on this anymore. Could we please merge this PR to make the officeparser at least partially Python 3 compatible so others can continue without repeating the same work?