Open hlein opened 2 years ago
I was going to have a look into this but realized I probably don't have enough time to untangle this right now since it's tied to multiple things, so instead I'll try to leave some notes that might be helpful for anyone else looking into it.
Right now the Handler
interface has FromFile(context.Context, io.Reader) chan ([]byte)
. For archive handler, we might instead want the return type to be (path string, []byte)
. Then we could update some field on the chunk.SourceMetadata
to represent any sub-archive paths.
The problems that I see with it:
Source
types have unique fields for setting paths. For example, in filesystem
, it would be .Filesystem.File
, S3 would be .S3.File
, GitHub's is .Github.File
. Even File
itself is not guaranteed, as in the case of Circleci, which might be .Circleci.Link
(not sure).Suggestion might be to add something like ArchivePath
to SourceMetadata
directly, where you can set full paths, like some_container/b0d4d7051229875a2bfd9809c631c9899748f0e1fc6f408a446048dc6b60ca20/layer.tar:/usr/share/doc/perl-IO-Socket-SSL/example/simulate_proxy.pl
. More generally it could look like PATH_TO_FILE_IN_ARCHIVE[:PATH_TO_FILE_IN_SUB_ARCHIVE]...
Community Note
Description
It would be nice if
trufflehog
could smartly scan nested .tar files, as seen in e.g.docker
containers.Problem to be Addressed
When scanning a
docker
image tarball (such as one saved withdocker save ...
),trufflehog
currently just prints the top-level.tar
filename for every hit. This doesn't give a lot of transparency to what component inside the image, or what resulting file path inside a container launched using the image, contains the hit.Description of the Preferred Solution
Best-case,
trufflehog
would understand and record-keep when looking insidetar
archives, and support doing so in a nested fashion, because docker images are typically nested .tar files of multiple layers, and then print out that context on a hit, maybe something like:Maybe this would be something generalized, that makes
trufflehog filesystem
smarter. Or, it might have to be a dedicated mode,trufflehog archive
or something. Uncompressed.tar
is one thing; I expect compressed archives would be more painful.Additional Context
There is a fuse filesystem for mounting archives which supports recursive/nested archives as well, https://github.com/mxmlnkn/ratarmount, which transparently turns archive files into subdirectories.
So for example:
Or for a large collection of them:
If adding native nested-archive support does not seem worth it/desirable, then perhaps just polish/improve this example and document it somewhere.