Some directories are not reported

kukovecz commented 1 year ago

When extracting chunks, there is a logic for handling the whole chunks differently, here. This results that in some cases some directories are not reported.

Reproduce this with this test file: test.zip. This is actually from the integration test suit, but I had to zip it for github to allow me attach it.

If I run this file with unblob and check the report, I get the following item:

A part of the generated report json

```json { "task": { "path": "/tmp/fruits.lvl1.lzh", "depth": 0, "chunk_id": "", "__typename__": "Task" }, "reports": [ { "path": "/tmp/fruits.lvl1.lzh", "size": 146, "is_dir": false, "is_file": true, "is_link": false, "link_target": null, "__typename__": "StatReport" }, { "magic": " LHarc 1.x/ARX archive data [lh0], 0x0 OS, with \"apple.txt\"\\012- data", "mime_type": "application/x-lzh-compressed", "__typename__": "FileMagicReport" }, { "md5": "cf71709694cd2f3e98fcf87524194beb", "sha1": "701248bfd7dd7a7360ce237754a82425d1d13346", "sha256": "e016f42094b088058e7fa5d9c3f98bafaeac87899205192d95b8001f72058a0f", "__typename__": "HashReport" }, { "chunk_id": "47941:3", "handler_name": "lzh", "start_offset": 96, "end_offset": 146, "size": 50, "is_encrypted": false, "extraction_reports": [], "__typename__": "ChunkReport" }, { "chunk_id": "47941:2", "handler_name": "lzh", "start_offset": 47, "end_offset": 96, "size": 49, "is_encrypted": false, "extraction_reports": [], "__typename__": "ChunkReport" }, { "chunk_id": "47941:1", "handler_name": "lzh", "start_offset": 0, "end_offset": 47, "size": 47, "is_encrypted": false, "extraction_reports": [], "__typename__": "ChunkReport" } ], "subtasks": [ { "path": "/tmp/unblob/fruits.lvl1.lzh_extract/96-146.lzh_extract", "depth": 1, "chunk_id": "47941:3", "__typename__": "Task" }, { "path": "/tmp/unblob/fruits.lvl1.lzh_extract/47-96.lzh_extract", "depth": 1, "chunk_id": "47941:2", "__typename__": "Task" }, { "path": "/tmp/unblob/fruits.lvl1.lzh_extract/0-47.lzh_extract", "depth": 1, "chunk_id": "47941:1", "__typename__": "Task" } ], "__typename__": "TaskResult" } ```

This means, when unblob handles /tmp/fruits.lvl1.lzh, it will create 3 subtasks:

/tmp/unblob/fruits.lvl1.lzh_extract/96-146.lzh_extract
/tmp/unblob/fruits.lvl1.lzh_extract/47-96.lzh_extract
/tmp/unblob/fruits.lvl1.lzh_extract/0-47.lzh_extract

And will continue to run for those (sub)tasks. However a task for the /tmp/unblob/fruits.lvl1.lzh_extract directory is never created, so that directory is just there in the file system without actually being in the generated report.

e3krisztian commented 1 year ago

The directory not being reported/processed as a Task is an auxiliary directory, that is used only to carve chunks to, we did not assign any report to it, yet, because it was not necessary so far.

If it is really needed a new report type on chunks (CarveReport?) could resolve this.

e3krisztian commented 1 year ago

Related: #326.

I am not sure we need to do anything with it, though.

martonilles commented 1 year ago

Option could be to move the carved files out of the extraction tree structure and store them separately. Also in most cases we are deleting the carves, also carves are easily reproducable.

This way we can use the followning extraction tree structure:

/tmp/unblob/fruits.lvl1.lzh_96-146_extract/
/tmp/unblob/fruits.lvl1.lzh_47-96_extract/
/tmp/unblob/fruits.lvl1.lzh_0-47_extract/

qkaiser commented 2 months ago

This issue is causing problems with people wanting to do nice things with the unblob API from Python. See https://github.com/onekey-sec/unblob/issues/878

AndrewFasano commented 2 months ago

This was blocking my ability to map between extraction directories and the blobs they were derived from with the API so I took a stab at it in #891. I didn't figure out how to add a new task/subtask for carving, instead I just added a new report type that logs the source and destination of each carve.

With the example fruits.lvl1 file I the following new outputs are produced in the log which allows a consumer of the log to map between the fruits.lvl1.lzh file and the 3 carved files: fruits.lvl1.lzh_extract/96-146.lzh, fruits.lvl1.lzh_extract/47-96.lzh, and fruits.lvl1.lzh_extract/0-47.lzh.

       {
        "carved_from": "/tmp/unblob/fruits.lvl1.lzh",
        "carved_to": "/tmp/unblob/fruits.lvl1.lzh_extract/96-146.lzh",
        "start_offset": 96,
        "end_offset": 146,
        "handler_name": "lzh",
        "__typename__": "CarveReport"
      },
      {
        "carved_from": "/tmp/unblob/fruits.lvl1.lzh",
        "carved_to": "/tmp/unblob/fruits.lvl1.lzh_extract/47-96.lzh",
        "start_offset": 47,
        "end_offset": 96,
        "handler_name": "lzh",
        "__typename__": "CarveReport"
      },
      {
        "carved_from": "/tmp/unblob/fruits.lvl1.lzh",
        "carved_to": "/tmp/unblob/fruits.lvl1.lzh_extract/0-47.lzh",
        "start_offset": 0,
        "end_offset": 47,
        "handler_name": "lzh",
        "__typename__": "CarveReport"
      },

onekey-sec / unblob

Some directories are not reported #554