sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
945 stars 218 forks source link

Add a RawStringsParser for non Latin1 languages #441

Open lfcnassif opened 3 years ago

lfcnassif commented 3 years ago

Current RawStringsParser, used to extract strings from unallocated, unknown, corrupted or not supported files, extracts Latin1 scripts encoded with windows-1252, UTF-8 or UTF-16, even mixed in the same file. That is a custom implementation and very fast strings extractor.

We should add a more generic strings extractor where the encodings or scripts extracted could be configured by the user, even if it is much slower than the default.

lfcnassif commented 2 years ago

I just came up with an idea for this, we could use a similar heuristic for charset detection implemented months ago for PST/OST emails with unknown charset, running the detection on small blocks with some intersection. Not sure about the block and intersection sizes, this would need testing. Probably will be slower and disabled by default, but should be generic enough to handle different charsets and scripts.