moyix / pdbparse

Python code to parse Microsoft PDB files
Other
309 stars 83 forks source link

Runs slow. Anyone interested in improving performance? #43

Open mrolle45 opened 6 years ago

mrolle45 commented 6 years ago

I don't want to take the time right now to submit performance enhancements, but perhaps @moyix or some other person reading this not would like to do the work. I find that a tremendous amount of time is spent with file reads, string concatenations, and substring operations. There are two ways to speed things up that I have seen, and would be simple to implement:

  1. In StreamFile class, cache the stream pages, so you only have to read them once from the file. Or better, if the platform supports mmap, just mmap the entire PDB file, create a buffer for it, and take a slice of the buffer for a stream page whenever you need it. In the non-mmap case, you could add a method to clear the cache, to be called, for instance, after parsing the entire stream.
  2. In StreamFile._read, see how many pages are spanned by the request. Use the above cache / mmap to get slices of individual pages. Return the slice, or a concatenation of two slices, or use CStringIO to assemble more than two slices. Using _read_pages is inefficient because then you have to take a slice of the result.

I think this would eliminate most of the time spent in parsing a PDB as a whole. You could try profiling pdbparse with a large file, such as ntoskrnl.pdb.

ZhangShurong commented 6 years ago

@mrolle45 I tried mmap, But it still very slow, Do you have any suggestions?

moyix commented 6 years ago

You should try profiling, but my guess is that some of the slowness is due to the use of Construct. One workaround is to only parse the streams you need for a particular task; you can see an example of this here:

https://github.com/moyix/pdbparse/blob/ea5f2aa3165770343d8444e03de1334c8ddcfc56/examples/pdb_tpi_vtypes.py#L160-L162