onekey-sec / unblob

Extract files from any kind of container formats
https://unblob.org
Other
2.11k stars 80 forks source link

Rework chunk creation and processing workflow #369

Open vlaci opened 2 years ago

vlaci commented 2 years ago

This issue is about a refactor idea that came up while review-ing #357. The below steps are in decreasing order of details as the outcome of further steps can be significantly altered by new ideas coming up during execution of the first ones.

search_chunks should return all the chunks

Currently chunks are identified in two distinct places during processing of a file: in finder.search_chunks which returns a list of only known chunks, then in processing._FileTask.process we fill in the gaps between chunks and beginning and end of the input file.

The basic idea is that search_chunks could do this preprocessing and in addition to that return an unknown chunk covering the whole file. Essentially, the following test change needs to pass:

diff --git a/tests/test_processing.py b/tests/test_processing.py
index 82ff857..71fcf1d 100644
--- a/tests/test_processing.py
+++ b/tests/test_processing.py
@@ -90,7 +90,7 @@ def test_remove_inner_chunks(
     "chunks, file_size, expected",
     [
         ([], 0, []),
-        ([], 10, []),
+        ([], 10, [UnknownChunk(0, 10)]),
         ([ValidChunk(0x0, 0x5)], 5, []),
         ([ValidChunk(0x0, 0x5), ValidChunk(0x5, 0xA)], 10, []),
         ([ValidChunk(0x0, 0x5), ValidChunk(0x5, 0xA)], 12, [UnknownChunk(0xA, 0xC)]),

This requires a separate change so that carve_unknown_chunk should be guarded the same way as carve_valid_chunk is: https://github.com/IoT-Inspector/unblob/blob/dbc104ffd3cbebd584af60e8f4fea548824b2a64/unblob/processing.py#L249-L256

Given, that chunks are also ordered, it will also make the chunks in metadata ordered as an added bonus.

OO wrapping of chunks with operations to do on them

After the above changes, there are different possibilities to go forward, I'll just outline one possible way here.

Adjust metadata creation

The primary goal of these changes is that chunk metadata handling can be encapsulated entiirely in the scope of of the processing module. E.g. chunk related information can be added to the new wrapping object created in the above steps. The new chunk abstractions also have access to the file path and arbitrary extra information we may add to them, so that we could also generate predictable id-s given the input path and offset-length pairs.

qkaiser commented 2 years ago

On the subject of OO wrapping of chunks, I think this could facilitate the work on #274 . An UnknownChunk object could have a dedicated function to do some "introspection" on its content (i.e. file content encompassing that unknown chunk) and report on it by mutating its class into one of Chunk's subclasses.

Example: we have an UnknownChunk with only null padding in it, the introspection function runs and the object mutates into a NullPadChunk.

Just some ideas, please don't hit me if I broke every design pattern in the book :)