fix(fs): duplicate entries handling in FileSystem API.

qkaiser commented 5 months ago

Early draft to take care of archives and filesystems holding duplicate entries.

We need to decide on a strategy when that happens:

overwrite the previous entry with the new one ? (overwrite)
do not overwrite the previous entry with the new one ? (skip)
write the file with a new suffix (e.g. .1 à la wget) ? (rewrite)

Do we want to compare hashes before creating a renamed entry or not ?

What happens if we have triplicate or quadruplicate entries ?

Resolve #754

AndrewFasano commented 5 months ago

Thanks for drafting a fix for this issue. If the files are the same, skipping the syntax error and keeping everything else the same seems like a good solution.

When the files are different, it's probably useful for most users to be able to see all versions. I don't love the .1 syntax though, just because that makes the output directory less like how the files would look like in a standard system that was using the filesystem.

I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden? If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location?

AndrewFasano commented 5 months ago

I tried this out and found there's still an issue with some firmware images: if fs_path.absolute_path.exists() and fs_path.absolute_path is a directory, the unlink call fails:

2024-02-12 02:15.14 [error    ] Unknown error happened while extracting chunk pid=2445276
Traceback (most recent call last):
  File "/unblob/unblob/processing.py", line 607, in _extract_chunk
    if result := chunk.extract(inpath, extract_dir):
  File "/unblob/unblob/models.py", line 118, in extract
    return self.handler.extract(inpath, outdir)
  File "/unblob/unblob/models.py", line 455, in extract
    return self.EXTRACTOR.extract(inpath, outdir)
  File "/unblob/unblob/handlers/archive/cpio.py", line 384, in extract
    parser.dump_entries(fs)
  File "/unblob/unblob/handlers/archive/cpio.py", line 217, in dump_entries
    fs.mkdir(
  File "/unblob/unblob/file_utils.py", line 524, in mkdir
    safe_path = self._get_extraction_path(path, "mkdir")
  File "/unblob/unblob/file_utils.py", line 485, in _get_extraction_path
    fs_path.absolute_path.unlink()
  File "/usr/lib/python3.10/pathlib.py", line 1206, in unlink
    self._accessor.unlink(self)
IsADirectoryError: [Errno 21] Is a directory: '/tmp/tmpg86lywif/DIR-655_REVB_FIRMWARE_2.00.ZIP_extract/dir655_revB_FW_200NA/DIR655B1_FW200NAb33.bin_extract/64-6212776.gzip_extract/gzip.uncompressed_extract/2906880-7696567.gzip_extract/gzip.uncompressed_extract/dev'

qkaiser commented 5 months ago

I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden?

Yes, that's usually what happens. Silently.

If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location?

Let's start with an implementation that simply overwrites and see what comes up from the team during review.

AndrewFasano commented 5 months ago

Sounds good to me.

In terms of this patch and the error I found above, I initially tried fixing it by making it support deleting the directory when it already existed. That turned out to be a bad idea as it would end up deleting a directory full of lots of extracted files only to replace it with an empty directory. Things seem better if it overwrites duplicate files and ignores existing directories:

diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index c96da27..68727d7 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -477,13 +477,22 @@ class FileSystem:
         fs_path = self._fs_path(path)

         if fs_path.absolute_path.exists():
-            report = ExtractionProblem(
-                path=str(fs_path.relative_path),
-                problem=f"Attempting to create a file that already exists through {path_use_description}",
-                resolution="Overwrite.",
-            )
-            fs_path.absolute_path.unlink()
-            self.record_problem(report)
+            if fs_path.absolute_path.is_file():
+                report = ExtractionProblem(
+                    path=str(fs_path.relative_path),
+                    problem=f"Attempting to create a file that already exists through {path_use_description}",
+                    resolution="Overwrite.",
+                )
+                self.record_problem(report)
+                fs_path.absolute_path.unlink()
+
+            elif fs_path.absolute_path.is_dir():
+                report = ExtractionProblem(
+                    path=str(fs_path.relative_path),
+                    problem=f"Attempting to create a directory that already exists through {path_use_description}",
+                    resolution="Ignore",
+                )
+                self.record_problem(report)

         if not fs_path.is_safe:
             report = PathTraversalProblem(

AndrewFasano commented 5 months ago

And there's another issue in here too - in _get_checked_link the src is the link target, while the dst is the file that's being created (discussion here). These names are very confusing. So the file already exists error should only be raised if the dst already exists.

diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index 65d7ad6..4ee7f97 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -565,7 +565,7 @@ class FileSystem:
     def _get_checked_link(self, src: Path, dst: Path) -> Optional[_FSLink]:
         link = _FSLink(root=self.root, src=src, dst=dst)

-        if link.src.absolute_path.exists():
+        if link.dst.absolute_path.exists():
             self.record_problem(link.format_report("File already exists."))
             return None
         if not link.is_safe:

nyuware commented 4 months ago

The code I pushed is the works of @qkaiser and @AndrewFasano.

Overwrites files that already exists.
Ignores directories if one already exists.

The issue is that we are tripping two tests:

test_open_no_path_traversal, fails because of assert sandbox.problems == [], which is no longer true because we raise Overwriting already existing file on file overwrites
test_open fails because of with sandbox.open(path, "rb+") as f: assert f.read() == bytes(100) + b"text" because the file already exists.

1. I think is a non issue 2. do we accept that a file CAN BE overwritten, thus dropping that part of the test, or do we not overwrite ?

qkaiser commented 4 months ago

@nyuware for test_open_no_path_traversal, you can filter on sandbox.problems. The test should verify that sandbox.problems does not contain PathTraversalProblem. Since the problem reported by your change is a ExtractionProblem it will be ok.

The test_open failure is indicative of something that must change in the API. We should only report a problem when trying to create a file that already exists. In the test_open, we're creating a file and then simply reading from it. Opening a file in read mode should not lead to an overwrite.

_get_extraction_path could take a mode parameter that defaults to wb, if the mode is not write, then we should not report an ExtractionProblem and overwrite.

e3krisztian commented 4 months ago

The FileSystem class (as its doc-string says) is intended to limit the output to a well known directory, and the class' responsibility should end there.

The original problem is a CPIO extractor problem. I think the resolution of conflicting output files should be handled mostly in the CPIO extractor.

The cpio extractor is using the FileSystem.carve() method, which will die, if the output already exists. Fortunately there is already a FileSystem.unlink(path) method, which will delete path, but will not complain if path does not exist. So I would simply put

fs.unlink(entry.path)

before

fs.carve(entry.path, self.file, entry.start_offset, entry.size)

in the cpio extractor.

I would also add a test file with duplicate files to cover the new overwriting behavior.

An alternative implementation could be to smart up the carve() function and method with an overwrite_existing=False default parameter, and change the carve call in cpio to have the extra overwrite_existing=True.

qkaiser commented 4 months ago

@e3krisztian yes that's the solution we agreed on with @nyuware , tyvm for your input ! :)

nyuware commented 3 months ago

According to @e3krisztian's comment, we decided to overwrite the files if they already exists by doing fs.unlink(_file_) before it gets carved again and fails to do so.

The 3 pushes were me failing to rebase the branch and forgetting the integration tests, apologies about that.

onekey-sec / unblob

fix(fs): duplicate entries handling in FileSystem API. #756