moyix / pdbparse

Python code to parse Microsoft PDB files
Other
309 stars 83 forks source link

Fix DBIStream: true number of NameRef is in the sum of cRefCnt #61

Open psrok1 opened 2 months ago

psrok1 commented 2 months ago

Hi and thanks for the great library!

I found that when I try to parse PDB for combase.dll with GUID 6c146f310d333559974d1d5d3fa2e4da1, it fails to decode some strings contained in DBI stream structures.

File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 554, in parse
return PDB7(f, fast_load)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 521, in __init__
self.read_root(self.root_stream)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 460, in read_root
pdb_cls(
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 154, in __init__
self.load()
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 276, in load
debug = dbi.parse_stream(self.stream_file)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/dbi.py", line 160, in parse_stream
Name = ("Name" / CString(encoding = "utf8")).parse(Names[NameRef[j]:])
...
File "/opt/venvs/drakrun/lib/python3.8/site-packages/construct/core.py", line 1490, in _decode
return obj.decode(self.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 0: invalid start byte

The reason is that cRefCnt is incorrect number of names when the true number exceeds 64K (this field is pretty short, just 16-bit). This behavior is documented here: https://llvm.org/docs/PDB/DbiStream.html#file-info-substream

NumSourceFiles: In theory this is supposed to contain the number of source files for which this substream contains information. But that would present a problem in that the width of this field being 16-bits would prevent one from having more than 64K source files in a program. In early versions of the file format, this seems to have been the case. In order to support more than this, this field of the is simply ignored, and computed dynamically by summing up the values of the ModFileCounts array (discussed below). In short, this value should be ignored.
FileNameOffsets - An array of NumSourceFiles integers (where NumSourceFiles here refers to the 32-bit value obtained from summing ModFileCountArray), where each integer is an offset into NamesBuffer pointing to a null terminated string.

After fix, combase.pdb is parsed correctly.

psrok1 commented 2 months ago

By the way, I temporarily merged your library code into https://github.com/CERT-Polska/drakpdb as you haven't made any releases for longer time and I can't pin to Git commit if I want to publish dependent package on PyPi.

I need to say that I really like the simplicity of your library and the fact that it doesn't give up when the new, unknown structure or leaf type is reached. I have tested few libraries on current Windows PDBs and pdbparse is the only library so far that is able to deliver basic information about exports and simple structures. I have tried the other solutions like:

So I hope you're still interested in maintaining this library and I think I will be coming back with patches from time to time. Cheers!