Open frenzymadness opened 5 years ago
Thank you very much for help it is cool. I am adding 1 more patch for xrange, 1 more for the "to_hex" output where the usage of binary strings would clash, and one little cosmetics patch to get rid of the annoying error for not passing a filename when officeparser executed without parameters. Still testing.
@xambroz I can give you commit rights to my repository so you can continue there and your commits appear here. What do you think?
Thank you - that would work. I will add what I have.
Currently I have patches which make it work on plain office file.
The only thing which I know is not working yet is the extraction of macroes, but I hope to fix that as well.
In the meanwhile - this is what I have to add at this point: https://github.com/frenzymadness/officeparser/pull/1
I know that --export-macros is not working in python3. Tested like this: 1) download malware sample xls with macros from hybrid-analysis.com https://www.hybrid-analysis.com/sample/8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f?environmentId=100
2) gunzip the file
gunzip 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin.gz
3) try with python2
python2 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
File "officeparser.py", line 1235, in <module>
_main()
File "officeparser.py", line 836, in _main
buffer = StringIO()
NameError: global name 'StringIO' is not defined
4) try with python3
$ python3 $(which officeparser.py) --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
File "/usr/bin/officeparser.py", line 1234, in <module>
_main()
File "/usr/bin/officeparser.py", line 835, in _main
buffer = StringIO()
NameError: name 'StringIO' is not defined
Even including "from io import StringIO" is not directly fixing the situation: 5) try with python2 and io.StringIO
python2 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
File "officeparser.py", line 1235, in <module>
_main()
File "officeparser.py", line 837, in _main
buffer.write(ofdoc.get_stream(project.index))
TypeError: unicode argument expected, got 'str'
6) try with python3 and io.StringIO
python3 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
File "officeparser.py", line 1235, in <module>
_main()
File "officeparser.py", line 837, in _main
buffer.write(ofdoc.get_stream(project.index))
TypeError: string argument expected, got 'bytes'
The original (cStringIO.StringIO) gives this:
$ python2 officeparser.py.orig --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
INFO: Saving VBA code to ./Sem_1.cls
INFO: Saving VBA code to ./Page1_1.cls
INFO: Saving VBA code to ./Module1_1.bas
INFO: Saving VBA code to ./UserForm1_1.frm
INFO: Saving VBA code to ./Module2_1.bas
INFO: Saving VBA code to ./Module3_1.bas
INFO: Saving VBA code to ./UserForm6_1.frm
INFO: Saving VBA code to ./Page11_1.cls
INFO: Saving VBA code to ./Module6_1.bas
INFO: Saving VBA code to ./Module5_1.bas
INFO: Saving VBA code to ./Module4_1.bas
INFO: Saving VBA code to ./Class1_1.cls
INFO: Saving VBA code to ./Sheet1_1.cls
I am investigating the macros extraction.
The very first question I need an answer for is whether PROJECT stream in a document should be handled as bytes or Unicode. Because now it's mixed and that's the reason why it does not work in Python 3. Do I understand it correctly that it contains some code in VB script so it should be handled as Unicode?
Hello, yes PROJECT stream in Office documents seems to hold metadata about the macros in the plaintext form in the INI format.
$ python2 officeparser.py.orig --dump-stream-by-name PROJECT word_form.doc
ID="{F71D9A8C-3763-458D-A309-7E5E41C49A1A}"
Document=ThisDocument/&H00000000
Module=NewMacros
Name="Project"
HelpContextID="0"
VersionCompatible32="393222000"
CMG="C1C327AD2BAD2BAD2BAD2B"
DPB="828064A724A824A824"
GC="4341A5E667E767E798"
[Host Extender Info]
&H00000001={3832D640-CF90-11CF-8E43-00A0C911005A};VBE;&H00000000
&H00000002={000209F2-0000-0000-C000-000000000046};Word8.0;&H00000000
[Workspace]
ThisDocument=46, 46, 678, 454,
NewMacros=69, 69, 678, 506, Z
Hello.
Unfortunately, I don't have the capacity to work on this anymore. Could we please merge this PR to make the officeparser at least partially Python 3 compatible so others can continue without repeating the same work?
Hello.
I am trying o make this tool Python 3 compatible while keeping backward compatibility with Python 2.7. I've tested my work with three scenarios and one testing Word document. I am not a user of this tool so I just compared the output for Python 2 and 3 and it seems to be okay.
Tested commands:
If you find something missing, please provide a reproducer (shell command) so I can use it to test my work and backward compatibility.
Fixes: #18