rowingdude / analyzeMFT

MIT License
423 stars 117 forks source link

Filename bug #17

Open Pballen opened 10 years ago

Pballen commented 10 years ago

The mft has problems dealing w/ filenames with periods in them. For example, "adobe/reader 9.0" is reported as "adobe/reader~1.0" and "9.3.0" becomes "932E79D~1.0". I tried looking the the unicode hack, but couldn't come up w/ any obvious solutions.

dkovar commented 10 years ago

I updated the filename processing last week or the week before. Are you running the latest code?

Pballen commented 10 years ago

I got the newer version and ran it. I'm getting some strange results. I think the periods are screwing things up?

"Documents and Settings\Admin\Application Data\Sun\Java\jre1.6.0_05" becomes "path\JRE16~1.0_0" "WINDOWS\assembly\NativeImages_v2.0.50727_32\System.ServiceModel" becomes "path\System~4.SER" "above path\System.ServiceModel#" becomes "path\Sy1587~1.SER"

Pballen commented 10 years ago

Curiouser and curiouser

"Documents and Settings\All Users\Documents\My Music" becomes "path\MYMUSI~1" (which happens to everything in the Music folder). However, "Documents and Settings\All Users\Documents\My Pictures" comes out fine.

dkovar commented 10 years ago

That looks like 8.3 naming:

http://support.microsoft.com/kb/142982

Have you looked at the raw MFT record?

williballenthin commented 10 years ago

Something to consider: an MFT record may have multiple $FN attributes, and it looks like analyzeMFT always picks the first encountered as the filename (https://github.com/dkovar/analyzeMFT/blob/master/analyzemft/mft.py#L331). I've found that the ordering of filename attributes is not consistent, and probably shouldn't be relied upon.

The namespace field at offset 0x41 describes which type of filename data a $FN attribute contains. http://lxr.linux.no/linux+v3.8.6/fs/ntfs/layout.h#L1012 lists the possible values, and personally I prioritize 0x1 (FILE_NAME_WIN32 ) and 0x3 (FILE_NAME_WIN32_AND_DOS) since they're "full" filenames. Perhaps considering these fields, and ordering the attributes in the record structure will ensure the most appropriate filename gets printed.

dkovar commented 10 years ago

Willi,

Superb information, thanks. That should be an easy fix. Shall get this done this week, hopefully.

-David

dkovar commented 10 years ago

Fix added. Please test it and let me know. (And thank you for finding and reporting these bugs!)

Pballen commented 10 years ago

We've gone from lots of bad filenames to only a handful, all limited to a single misread? character.

"Documents and Settings/User/My Documents/My Pictures/Guatemala/IMG_0715.jpeg" becomes "path/IMP_0175.JE-G" (the E has an accent). Similarly, "path/IMG_0196.jpeg" becomes "IMP_0196.JIuG" (the I has an accent, and the u is a microsign). Strangely, all of the other jpegs in the folder seem to come out fine.

"System Vol Info/restore{...}/RP256/40029775.old" becomes "path/40029775.03d".

dkovar commented 10 years ago

If you look at the raw record, is one of the other FN records more accurate? I just grabbed the first "full" name. It may be that I need to prioritize one over the other.

Pballen commented 10 years ago

The raw record only has a single FN record, but its reading name as "IMG_0175.J\xc8\x96G" Let me try getting the raw bits. My MFT might be bad?

dkovar commented 10 years ago

The $MFT might be good and the actual filename is the culprit.

Pballen commented 10 years ago

I opened the MFT in a hex viewer. The relevant bits (which do convert to IMG_0175.jpg) are: 49 00 4D 00 47 00 5F 00 30 00 31 00 37 00 36 00 2E 00 4A 00 50 00 47 00

I then added the line s = "".join(["%02X|"%ord(x) for x in bytes]) in the relevant place in decodeFNAttribute. The relevant bits are: 49 00 4D 00 47 00 5F 00 30 00 31 00 37 00 35 00 2E 00 4A 00 16 02 47 00

I think something strange is happening w/ reading in the raw_record. Not sure what.