trufflesecurity / trufflehog

Find, verify, and analyze leaked credentials
https://trufflesecurity.com
GNU Affero General Public License v3.0
15.51k stars 1.62k forks source link

Valid `.tar` archive inside `.gzip` archive not being extracted #2928

Closed rgmz closed 2 months ago

rgmz commented 2 months ago

Please review the Community Note before submitting

TruffleHog Version

https://github.com/trufflesecurity/trufflehog/commit/b0fd70c0ffb7dd9e4dc2fb3e844f7fdadf332e9e

Trace Output

$ trufflehog filesystem licenses.db
πŸ·πŸ”‘πŸ·  TruffleHog. Unearth your secrets. πŸ·πŸ”‘πŸ·

2024-06-05T22:37:51-04:00       info-0  trufflehog      running source  {"source_manager_worker_id": "H6Lcl", "with_units": true}
2024-06-05T22:37:51-04:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "H6Lcl", "unit": "licenses.db", "unit_kind": "unit", "timeout": 30, "error": "error extracting archive with format: .tar: handling file: X11.txt: error creating custom reader: error identifying archive: matching 7z: zlib: invalid header"}
2024-06-05T22:37:51-04:00       info-0  trufflehog      finished scanning       {"chunks": 1229, "bytes": 13177071, "verified_secrets": 0, "unverified_secrets": 0, "scan_duration": "198.796129ms", "trufflehog_version": "dev"}

Actual trace with custom logging: https://gist.github.com/rgmz/ec4cf437744a1f55132ed1830f0c4f4b

Expected Behavior

The nested tar should extract.

Actual Behavior

A valid tar inside a gzip is not extracted:

2024-06-05T22:37:51-04:00 error trufflehog error unarchiving chunk. {"source_manager_worker_id": "H6Lcl", "unit": "licenses.db", "unit_kind": "unit", "timeout": 30, "error": "error extracting archive with format: .tar: handling file: X11.txt: error creating custom reader: error identifying archive: matching 7z: zlib: invalid header"}

Works manually

$ file -i licenses.db
licenses.db: application/gzip; charset=binary
$ gunzip licenses.db
$ ls
licenses
$ file -i licenses
licenses: application/x-tar; charset=binary
$ tar xf licenses
0BSD.hash               BCL.hash                       CC-BY-NC-SA-2.5.hash ...

Steps to Reproduce

  1. Download https://github.com/kubernetes/git-sync/blob/b161f3f0c78b56f27188b4e4aabf672ba0b03706/vendor/github.com/google/licenseclassifier/licenses/licenses.db
  2. Run trufflehog and observe the reported error

Environment

N/A

Additional Context

N/A

References

N/A

rgmz commented 2 months ago

It seems like the file X11.txt is being detected as an archive, for some reason. It's just a text file.

2024-06-07T14:41:23-04:00       info-0  trufflehog      Handling extracted file.        {"source_manager_worker_id": "I7eOu", "unit": "/tmp/licenses.db", "unit_kind": "unit", "timeout": 30, "filename": "WTFPL.txt", "size": 415}
2024-06-07T14:41:23-04:00       info-0  trufflehog      openArchive     {"source_manager_worker_id": "I7eOu", "unit": "/tmp/licenses.db", "unit_kind": "unit", "timeout": 30, "filename": "WTFPL.txt", "size": 415, "depth": 2}
2024-06-07T14:41:23-04:00       info-0  trufflehog      Handling extracted file.        {"source_manager_worker_id": "I7eOu", "unit": "/tmp/licenses.db", "unit_kind": "unit", "timeout": 30, "filename": "WTFPL.hash", "size": 2568}
2024-06-07T14:41:23-04:00       info-0  trufflehog      openArchive     {"source_manager_worker_id": "I7eOu", "unit": "/tmp/licenses.db", "unit_kind": "unit", "timeout": 30, "filename": "WTFPL.hash", "size": 2568, "depth": 2}
2024-06-07T14:41:23-04:00       info-0  trufflehog      Handling extracted file.        {"source_manager_worker_id": "I7eOu", "unit": "/tmp/licenses.db", "unit_kind": "unit", "timeout": 30, "filename": "X11.txt", "size": 1292}
2024-06-07T14:41:23-04:00       info-0  trufflehog      archive.extractorHandler: error creating custom reader  {"source_manager_worker_id": "I7eOu", "unit": "/tmp/licenses.db", "unit_kind": "unit", "timeout": 30, "filename": "X11.txt", "size": 1292, "error": "error identifying archive: matching rar: zlib: invalid header"}
error extracting archive with format: .tar: handling file: X11.txt: error creating custom reader: error identifying archive: matching rar: zlib: invalid header
panic: handling file: X11.txt: error creating custom reader: error identifying archive: matching rar: zlib: invalid header

The error matching tar: zlib: invalid header is coming from handlers.go#newFileReader.

https://github.com/trufflesecurity/trufflehog/blob/f122b295bf4d80edf9218bef2a454a60c039be62/pkg/handlers/handlers.go#L67-L83

rgmz commented 2 months ago

This appears to be an issue with mholt/archiver, assuming that going archive.Decompressor > archive.Extractor is the intended use. Running this code against the .tar archive works without issue. There's something about the handling of .gz and then .tar.

Code

func main() {
    filename := "licenses.db"
    input, ferr := os.Open("/tmp/" + filename)
    if ferr != nil {
        fmt.Printf("failed to open file: ")
        panic(ferr)
    }

    //rdr := input
    rdr, rErr := readers.NewBufferedFileReader(input)
    if rErr != nil {
        fmt.Printf("failed to create file reader")
        panic(rErr)
    }
    run2(filename, rdr)
    return
}

// really polished code
func run2(filename string, rdr io.Reader) io.Reader {
    format, _, ferr := archiver.Identify(filename, rdr)
    if ferr != nil {
        fmt.Printf("failed to identify file")
        panic(ferr)
    }
    fmt.Printf("ARchive foramt is: %v\n", format.Name())

    switch archive := format.(type) {
    case archiver.Decompressor:
        fmt.Printf("Archive is: Decompressor\n")

        compReader, err := archive.OpenReader(rdr)
        if err != nil {
            panic(fmt.Errorf("error opening decompressor with format: %s %w", format.Name(), err))
        }

        run2(filename, compReader)
    case archiver.Extractor:
        fmt.Printf("Archive is: Extractor\n")

        handler := func(ctx con.Context, f archiver.File) error {
            fmt.Printf("Handling file: %s\n", f.Name())
            return nil
        }

        eErr := archive.Extract(context.Background(), rdr, nil, handler)
        fmt.Printf("archive.Extract error: %v\n", eErr)
        return nil
    default:
        fmt.Printf("Archive is unknown: %v\n", archive.Name())
        return nil
    }

    return nil
}

Logs

Archive format is: .tar.gz Archive is: Decompressor ARchive foramt is: .tar Archive is: Extractor archive.Extract error: archive/tar: invalid tar header

rgmz commented 2 months ago

assuming that going archive.Decompressor > archive.Extractor is the intended use.

The .tar.gz file is of type archiver.CompressedArchive. It seems that CompressedArchive is both archiver.Decompressor and archiver.Extractor

This means that it matches both cases in the switch, but defaults to Decompressor because it's first in the list.

Works

switch archive := format.(type) {
case archiver.Extractor:
  ...
case archiver.Decompressor:
  ...
default:
  ...
}

Does not work

switch archive := format.(type) {
case archiver.Decompressor:
  ...
case archiver.Extractor:
  ...
default:
  ...
}

HOWEVER, all this means is that the issue is likely with TruffleHog, not mholt/archiver. Changing the order in archive.go does not prevent the reported error.

The cause could be related to https://github.com/trufflesecurity/trufflehog/issues/2927#issuecomment-2155320167.

rgmz commented 2 months ago

Testing based on #2943, this error seems to occur when:

  1. archive.HandleFile is called on licenses.db https://github.com/trufflesecurity/trufflehog/blob/440398815128e1f066ba5b49e65b1c0c3ecae200/pkg/handlers/archive.go#L64-L66
  2. archive.openArchive calls extractorHandler https://github.com/trufflesecurity/trufflehog/blob/440398815128e1f066ba5b49e65b1c0c3ecae200/pkg/handlers/archive.go#L113-L114
  3. extractorHandler calls newFileReader for the file X11.txt https://github.com/trufflesecurity/trufflehog/blob/440398815128e1f066ba5b49e65b1c0c3ecae200/pkg/handlers/archive.go#L186-L187
  4. newFileReader calls archiver.Identify https://github.com/trufflesecurity/trufflehog/blob/440398815128e1f066ba5b49e65b1c0c3ecae200/pkg/handlers/handlers.go#L67-L68
  5. archiver.Identify returns an error other than archiver.ErrNoMatch, for some reason, which triggers the default case https://github.com/trufflesecurity/trufflehog/blob/440398815128e1f066ba5b49e65b1c0c3ecae200/pkg/handlers/handlers.go#L81-L82

It appears that there's an error with ~TruffleHog's logic causing this, however, I haven't yet confirmed this.~ the upstream archiver library.

Here's my reproducer test code so far, if anyone else wants to look at this.

test.go ```go import ( "context" "fmt" "io" "os" "testing" "github.com/go-errors/errors" "github.com/mholt/archiver/v4" "github.com/stretchr/testify/require" "github.com/trufflesecurity/trufflehog/v3/pkg/handlers" ) func TestTarGz(t *testing.T) { r := map[string]func() io.Reader{ "os.Open": func() io.Reader { f, err := os.Open("/tmp/licenses.db") require.NoError(t, err) return f }, "BufferedFileReader": func() io.Reader { f, err := os.Open("/tmp/licenses.db") require.NoError(t, err) // I copied `handlers.newFileReader` as `NewFileReader` f2, err := handlers.NewFileReader("/tmp/licenses.db", f) require.NoError(t, err) return f2 }, } for name, reader := range r { t.Run(name, func(t *testing.T) { files := make([]string, 0) handleArchive(t, "", reader(), &files) require.Equal(t, 356, len(files)) }) } } // This is meant to follow the logic of `archive.openArchive`. func handleArchive(t *testing.T, filename string, rdr io.Reader, files *[]string) io.Reader { format, rdr2, err := archiver.Identify(filename, rdr) require.NoError(t, err) switch archive := format.(type) { case archiver.Extractor: handler := func(ctx context.Context, file archiver.File) error { *files = append(*files, file.Name()) f, err := file.Open() require.NoError(t, err) defer f.Close() format, _, err := archiver.Identify(file.Name(), f) if err == nil { fmt.Printf("File '%s' is format '%s'\n", file.Name(), format.Name()) } else if errors.Is(err, archiver.ErrNoMatch) { //fmt.Printf("File '%s' is not an archive\n", file.Name()) } else { t.Errorf("Error identifying '%s' format: %v\n", file.Name(), err) } return nil } err := archive.Extract(context.Background(), rdr2, nil, handler) require.NoError(t, err) return nil case archiver.Decompressor: compReader, err := archive.OpenReader(rdr2) require.NoError(t, err) return handleArchive(t, filename, compReader, files) default: t.Errorf("Archive is unknown: %v\n", archive.Name()) return nil } } ```
rgmz commented 2 months ago

Fixed by #2959.