Dealing with large MFTs

Pballen commented 11 years ago

The new version (v2.0) starts running into memory problems when trying to deal with large (>400000 entry) MFTs. I've run it on a couple MFTs the old version could easily handle and I get MemoryErrors.

I suspect this is because the new version tries storing every record in a single giant dictionary (although it's possible there's a memory leak somewhere that I'm not seeing, but I don't think that's the case). I'm not sure there's an easy fix. Use NumPy? Don't store the data at all (like the previous version)?

dkovar commented 11 years ago

Greetings,

Yea, I was afraid of that. I load everything into memory so I can do the full path analysis after everything is loaded. I'm also carrying around a lot of OOP baggage. May be better not to do this in an OOP manner.

-David

dkovar commented 11 years ago

Greetings,

Could you pull the new version and try again? I made one fix that might help, but am not sure.

The previous version did store the data as it had to calculate the full path after all the records were pulled in. If people didn't want the full path, I could make it run in about 25% less time and a lot less memory.

-David

Pballen commented 11 years ago

Tried it. No luck. I run out of memory during the while loop to read/build the mft. The buildfilepath code never gets a chance to run. Maybe using pickle or shelve could be a good idea, albeit slow.

dkovar commented 11 years ago

Greetings,

Ahh, that is helpful. Thank you for the feedback. I'm leaking memory, or just not using it efficiently. I'll dig into it again.

-David

dkovar commented 11 years ago

Greetings,

I de-OOP'd the main record parsing section. This might resolve the memory bloat problem. Please try again, and thanks for helping!

-David

Pballen commented 11 years ago

Sadly, the new version did not solve the problem (although I do like the modularized design, very sleek). However, I was able to modify the new version fairly easily using cpickle to make it work (essentially, instead of saving MFTrecords, store pickled MFTrecords and unpickle when needed).

change log: line 24: import cPickle line 136: replace record w/ cPickle.dumps(record) line 148: insert rcd = cPickle.loads(self.mft[i]) line 149-154: replace all instances of self.mft[i] with rcd line 161: insert rcd = cPickle.loads(self.mft[seqnum]) lines 162-183: Replace all instances of self.mft[seqnum] w/ rcd lines 162-183: Before every instance of return, insert self.mft[seqnum] = cPickle.dumps(rcd) line 195: insert rcd = cPickle.load(self.mft[i]) line 196-203: replace self.mft[i] with rcd line 204: insert self.mft[i] = cPickle.dumps(rcd)

williballenthin commented 11 years ago

I've run into similar issues when processing MFT files from volumes on file servers. Its not uncommon for me to see MFTs with file size in excess of 2.0 GB (therefore, about 3,145,728 records). In these cases, its either:

load all records into memory and throw more hardware (memory) at the problem
incrementally parse the MFT, but wait longer for full processing

The first approach is ideal; however, in the real world, I've found it doesn't scale too well (hence, this issue). @dkovar, you may consider printing out entries as you parse them, rather than building a large data structure that maps to the FS tree in memory. Do you think this is doable with AnalyzeMFT?

dkovar commented 11 years ago

Greetings,

I need to keep some form of the record around to build the full path name for each record at the end of the run. I cannot do this when I first see the record as I need the record's parent's record.

I currently parse each record, turn it into an intermediate form, and loop until I'm out of records. Once I've processed all the records, I run back through the entire set using recursion to build each record's full path. I then run through it all one more time to print out the results.

This is all somewhat painful and inefficient and I'd love suggestions for improving the process.

If I didn't need to calculate full path names, I could do this in a single, efficient pass.

Am I missing something?

I could write it all to an intermediate database ....

-David

williballenthin commented 11 years ago

David,

Yes, you're right that you need to navigate to the parent record in order to completely resolve the full path for a file. This is definitely easier if you've loaded all the records (as you do now), but it requires all that memory. An alternative approach is it load the parent records temporarily, and on demand, when you need to build the path. Surely this is a slower approach, because there is a lot of duplicated parsing of records, but its a tradeoff vs. memory usage. Perhaps you could fall back to this strategy if the MFT is larger than 500MB?

As an example of this approach, see the the function mft_record_build_path (https://github.com/williballenthin/INDXParse/blob/master/MFT.py#L950). It recursively looks up the parent record, parses it for the path component, and finally returns a complete path.

I haven't studied the analyzeMFT code very recently (post OOP refactor), so I'm not sure how easily it will be to integrate this approach. But its something to consider for the future. Let me know if you have any questions and I'll be happy to share my experiences, too.

Pballen commented 11 years ago

An alternate approach that seems good and looks reasonably easy to implement: 1) Read through the MFT, saving only the bare minimum needed to construct the filepaths in a minimft 2) Build the filepaths using minimft 3) Read through the MFT a second time. Now that we have the filepaths, it is possible to either store the entire thing into a large MFT (if desired) or simply output the entire thing to the csv.

Reading and building the filepaths for a 750k file system took about 3 min and needed about 110MB, leading me to think this approach would work for any reasonably sized MFT.

Pballen commented 11 years ago

I just uploaded a version w/ the changes described above implemented. It seems to work for large MFTs. From start to finish, my 750k system took 4.5 minutes, which is pretty good. I think it would also work for Willi's super-large MFTs, although testing would be good.

By default, the MFT is not stored in memory. This can be toggled w/ the -s command (before running, the code will also preform a quick check to see if there's enough room in memory and quit if there's clearly not enough).

Take a look when you get the chance: https://github.com/Pballen/analyzeMFT/tree/master/analyzemft

dkovar commented 10 years ago

Thank you for the contributions, much appreciated!

-David

rowingdude / analyzeMFT

Dealing with large MFTs #6