Closed qkaiser closed 3 months ago
Thanks for drafting a fix for this issue. If the files are the same, skipping the syntax error and keeping everything else the same seems like a good solution.
When the files are different, it's probably useful for most users to be able to see all versions. I don't love the .1
syntax though, just because that makes the output directory less like how the files would look like in a standard system that was using the filesystem.
I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden? If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location?
I tried this out and found there's still an issue with some firmware images: if fs_path.absolute_path.exists()
and fs_path.absolute_path
is a directory, the unlink call fails:
2024-02-12 02:15.14 [error ] Unknown error happened while extracting chunk pid=2445276
Traceback (most recent call last):
File "/unblob/unblob/processing.py", line 607, in _extract_chunk
if result := chunk.extract(inpath, extract_dir):
File "/unblob/unblob/models.py", line 118, in extract
return self.handler.extract(inpath, outdir)
File "/unblob/unblob/models.py", line 455, in extract
return self.EXTRACTOR.extract(inpath, outdir)
File "/unblob/unblob/handlers/archive/cpio.py", line 384, in extract
parser.dump_entries(fs)
File "/unblob/unblob/handlers/archive/cpio.py", line 217, in dump_entries
fs.mkdir(
File "/unblob/unblob/file_utils.py", line 524, in mkdir
safe_path = self._get_extraction_path(path, "mkdir")
File "/unblob/unblob/file_utils.py", line 485, in _get_extraction_path
fs_path.absolute_path.unlink()
File "/usr/lib/python3.10/pathlib.py", line 1206, in unlink
self._accessor.unlink(self)
IsADirectoryError: [Errno 21] Is a directory: '/tmp/tmpg86lywif/DIR-655_REVB_FIRMWARE_2.00.ZIP_extract/dir655_revB_FW_200NA/DIR655B1_FW200NAb33.bin_extract/64-6212776.gzip_extract/gzip.uncompressed_extract/2906880-7696567.gzip_extract/gzip.uncompressed_extract/dev'
I don't have a deep understanding of these file formats. When two files exist with the same name what would actually happen in a typical system that used the format? Would only one version of the file show up and one be hidden?
Yes, that's usually what happens. Silently.
If that's the case (and there's a way to tell which one would show up), perhaps selecting the normal file for the standard unblob output location would be good and then adding the second (or third, etc) into a secondary output location?
Let's start with an implementation that simply overwrites and see what comes up from the team during review.
Sounds good to me.
In terms of this patch and the error I found above, I initially tried fixing it by making it support deleting the directory when it already existed. That turned out to be a bad idea as it would end up deleting a directory full of lots of extracted files only to replace it with an empty directory. Things seem better if it overwrites duplicate files and ignores existing directories:
diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index c96da27..68727d7 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -477,13 +477,22 @@ class FileSystem:
fs_path = self._fs_path(path)
if fs_path.absolute_path.exists():
- report = ExtractionProblem(
- path=str(fs_path.relative_path),
- problem=f"Attempting to create a file that already exists through {path_use_description}",
- resolution="Overwrite.",
- )
- fs_path.absolute_path.unlink()
- self.record_problem(report)
+ if fs_path.absolute_path.is_file():
+ report = ExtractionProblem(
+ path=str(fs_path.relative_path),
+ problem=f"Attempting to create a file that already exists through {path_use_description}",
+ resolution="Overwrite.",
+ )
+ self.record_problem(report)
+ fs_path.absolute_path.unlink()
+
+ elif fs_path.absolute_path.is_dir():
+ report = ExtractionProblem(
+ path=str(fs_path.relative_path),
+ problem=f"Attempting to create a directory that already exists through {path_use_description}",
+ resolution="Ignore",
+ )
+ self.record_problem(report)
if not fs_path.is_safe:
report = PathTraversalProblem(
And there's another issue in here too - in _get_checked_link
the src
is the link target, while the dst
is the file that's being created (discussion here). These names are very confusing. So the file already exists error should only be raised if the dst
already exists.
diff --git a/unblob/file_utils.py b/unblob/file_utils.py
index 65d7ad6..4ee7f97 100644
--- a/unblob/file_utils.py
+++ b/unblob/file_utils.py
@@ -565,7 +565,7 @@ class FileSystem:
def _get_checked_link(self, src: Path, dst: Path) -> Optional[_FSLink]:
link = _FSLink(root=self.root, src=src, dst=dst)
- if link.src.absolute_path.exists():
+ if link.dst.absolute_path.exists():
self.record_problem(link.format_report("File already exists."))
return None
if not link.is_safe:
The code I pushed is the works of @qkaiser and @AndrewFasano.
The issue is that we are tripping two tests:
test_open_no_path_traversal
, fails because of assert sandbox.problems == []
, which is no longer true because we raise Overwriting already existing file
on file overwritestest_open
fails because of
with sandbox.open(path, "rb+") as f: assert f.read() == bytes(100) + b"text"
because the file already exists.1.
I think is a non issue
2.
do we accept that a file CAN BE overwritten, thus dropping that part of the test, or do we not overwrite ?
@nyuware for test_open_no_path_traversal
, you can filter on sandbox.problems
. The test should verify that sandbox.problems
does not contain PathTraversalProblem
. Since the problem reported by your change is a ExtractionProblem
it will be ok.
The test_open
failure is indicative of something that must change in the API. We should only report a problem when trying to create a file that already exists. In the test_open
, we're creating a file and then simply reading from it. Opening a file in read mode should not lead to an overwrite.
_get_extraction_path
could take a mode
parameter that defaults to wb
, if the mode is not write, then we should not report an ExtractionProblem
and overwrite.
The FileSystem
class (as its doc-string says) is intended to limit the output to a well known directory, and the class' responsibility should end there.
The original problem is a CPIO extractor problem. I think the resolution of conflicting output files should be handled mostly in the CPIO extractor.
The cpio
extractor is using the FileSystem.carve()
method, which will die, if the output already exists.
Fortunately there is already a FileSystem.unlink(path)
method, which will delete path
, but will not complain if path
does not exist.
So I would simply put
fs.unlink(entry.path)
before
fs.carve(entry.path, self.file, entry.start_offset, entry.size)
in the cpio
extractor.
I would also add a test file with duplicate files to cover the new overwriting behavior.
An alternative implementation could be to smart up the carve()
function and method with an overwrite_existing=False
default parameter, and change the carve
call in cpio
to have the extra overwrite_existing=True
.
@e3krisztian yes that's the solution we agreed on with @nyuware , tyvm for your input ! :)
According to @e3krisztian's comment, we decided to overwrite
the files if they already exists by doing fs.unlink(_file_)
before it gets carved
again and fails to do so.
The 3 pushes were me failing to rebase the branch and forgetting the integration tests, apologies about that.
Early draft to take care of archives and filesystems holding duplicate entries.
We need to decide on a strategy when that happens:
.1
à la wget) ? (rewrite)Do we want to compare hashes before creating a renamed entry or not ?
What happens if we have triplicate or quadruplicate entries ?
Resolve #754