trufflesecurity / trufflehog

Find, verify, and analyze leaked credentials
https://trufflesecurity.com
GNU Affero General Public License v3.0
16.83k stars 1.7k forks source link

Match optimization bug if keyword isn't a prefix #2949

Open rgmz opened 4 months ago

rgmz commented 4 months ago

Please review the Community Note before submitting

TruffleHog Version

https://github.com/trufflesecurity/trufflehog/commit/4e21590cbe895b0796acec8d3204f9f9013d9d5e

Description

The optimizations introduced in #2812 don't work as expected when multiple secrets are in the same chunk.

Given the following file you'd expect to receive two matches: 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 and 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. However, only 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 is detected.

#!/usr/bin/env python3

from dnslib import *
import argparse 
import socket 
from GeoInfo import GeoInfo
from geopy.distance import great_circle
import geoip2.webservice
from concurrent.futures import ThreadPoolExecutor

parser = argparse.ArgumentParser(description='Run a DNS server')
parser.add_argument('-p', '--port', type=int, required=True,help='port number to listen on')
parser.add_argument('-n', '--name', required=True, help='name of the DNS server')
args = parser.parse_args()
MM_ACCOUNT_ID = 851210
MM_ACCOUNT_ID2 = 852363
MM_API_KEY = "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"
MM_API_KEY2 = "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"
MM_HOST = "geolite.info"
ACCOUNT_1 = (MM_ACCOUNT_ID2, MM_API_KEY2)
ACCOUNT_2 = (MM_ACCOUNT_ID, MM_API_KEY)
DEFAULT_ACCOUNT = ACCOUNT_1 # Prepare two set of credentials for the API service

Steps to Reproduce

This includes the following change to demonstrate what the detector is receiving.

# pkg/detectors/maxmindlicense/v2/maxmindlicense_v2.go
 func (s Scanner) FromData(ctx context.Context, verify bool, data []byte) (results []detectors.Result, err error) {
    dataStr := string(data)
+     fmt.Printf("Data is: %s\n", dataStr)

Scan using the new matching optimization

$ ./trufflehog filesystem example.txt
🐷🔑🐷  TruffleHog. Unearth your secrets. 🐷🔑🐷

2024-06-09T11:40:39-04:00       info-0  trufflehog      running source  {"source_manager_worker_id": "NpZk4", "with_units": true}
Data is: _mmk"
MM_API_KEY2 = "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"
MM_HOST = "geolite.info"
ACCOUNT_1 = (MM_ACCOUNT_ID2, MM_API_KEY2)
ACCOUNT_2 = (MM_ACCOUNT_ID, MM_API_KEY)
DEFAULT_ACCOUNT = ACCOUNT_1 # Prepare two set of credentials for the API service

Data is: _mmk"
MM_API_KEY2 = "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"
MM_HOST = "geolite.info"
ACCOUNT_1 = (MM_ACCOUNT_ID2, MM_API_KEY2)
ACCOUNT_2 = (MM_ACCOUNT_ID, MM_API_KEY)
DEFAULT_ACCOUNT = ACCOUNT_1 # Prepare two set of credentials for the API service

Found unverified result 🐷🔑❓
Detector Type: MaxMindLicense
Decoder Type: PLAIN
Raw result: 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
Rotation_guide: https://howtorotate.com/docs/tutorials/maxmind/
Version: 2
File: example.txt
Line: 17

Scan the entire chunk

$ ./trufflehog filesystem example.txt --scan-entire-chunk
🐷🔑🐷  TruffleHog. Unearth your secrets. 🐷🔑🐷

2024-06-09T11:44:27-04:00       info-0  trufflehog      running source  {"source_manager_worker_id": "tK7mw", "with_units": true}
Data is: #!/usr/bin/env python3

from dnslib import *
import argparse
import socket
from GeoInfo import GeoInfo
from geopy.distance import great_circle
import geoip2.webservice
from concurrent.futures import ThreadPoolExecutor

parser = argparse.ArgumentParser(description='Run a DNS server')
parser.add_argument('-p', '--port', type=int, required=True,help='port number to listen on')
parser.add_argument('-n', '--name', required=True, help='name of the DNS server')
args = parser.parse_args()
MM_ACCOUNT_ID = 851210
MM_ACCOUNT_ID2 = 852363
MM_API_KEY = "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"
MM_API_KEY2 = "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"
MM_HOST = "geolite.info"
ACCOUNT_1 = (MM_ACCOUNT_ID2, MM_API_KEY2)
ACCOUNT_2 = (MM_ACCOUNT_ID, MM_API_KEY)
DEFAULT_ACCOUNT = ACCOUNT_1 # Prepare two set of credentials for the API service

Data is: #!/usr/bin/env python3

from dnslib import *
import argparse
import socket
from GeoInfo import GeoInfo
from geopy.distance import great_circle
import geoip2.webservice
from concurrent.futures import ThreadPoolExecutor

parser = argparse.ArgumentParser(description='Run a DNS server')
parser.add_argument('-p', '--port', type=int, required=True,help='port number to listen on')
parser.add_argument('-n', '--name', required=True, help='name of the DNS server')
args = parser.parse_args()
MM_ACCOUNT_ID = 851210
MM_ACCOUNT_ID2 = 852363
MM_API_KEY = "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"
MM_API_KEY2 = "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"
MM_HOST = "geolite.info"
ACCOUNT_1 = (MM_ACCOUNT_ID2, MM_API_KEY2)
ACCOUNT_2 = (MM_ACCOUNT_ID, MM_API_KEY)
DEFAULT_ACCOUNT = ACCOUNT_1 # Prepare two set of credentials for the API service

Found unverified result 🐷🔑❓
Detector Type: MaxMindLicense
Decoder Type: PLAIN
Raw result: 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
Rotation_guide: https://howtorotate.com/docs/tutorials/maxmind/
Version: 2
File: example.txt
Line: 16

Found unverified result 🐷🔑❓
Detector Type: MaxMindLicense
Decoder Type: PLAIN
Raw result: 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
Rotation_guide: https://howtorotate.com/docs/tutorials/maxmind/
Version: 2
File: example.txt
Line: 17

Environment

N/A

Additional Context

N/A

References

N/A

rgmz commented 4 months ago

FYI @dustin-decker @ahrav

ahrav commented 4 months ago

Let me take a closer look. My initial test finds both secrets 🤔 Screenshot 2024-06-09 at 9 14 58 AM

rgmz commented 4 months ago

Try testing with #2948. That changes the keywords to _mmk, which appears at the end of the key.

rgmz commented 4 months ago

I suspect this will occur for any secret where the keywords aren't at the beginning (e.g., OpenAI.)

I'm in New York at the moment, I'll do some more troubleshooting when I get home.

ahrav commented 4 months ago

I suspect this will occur for any secret where the keywords aren't at the beginning (e.g., OpenAI.)

I'm in New York at the moment, I'll do some more troubleshooting when I get home.

Yes, that's exactly why. I have a couple of solutions to work around this limitation:

Proposal 1: Extend the span to include data both before and after the keyword, creating a radius around it instead of only looking forward. This assumes we have many detectors where the keyword follows the credential.

Proposal 2: Create a new interface that detectors can optionally implement, instructing the spanCalculator to look backwards instead of forwards.

I'm waiting for input from others before deciding. There might even be a better approach that I'm missing.

rgmz commented 4 months ago

Proposal 1: Extend the span to include data both before and after the keyword, creating a radius around it instead of only looking forward. This assumes we have many detectors where the keyword follows the credential.

This is probably the best solution in the interim. There are a number of detectors that rely on keywords that aren't at the exact beginning of the secret.

Examples GCP relies on provider_x509 which is found near the end of the JSON. Preceding attributes like private_key_id, private_key, client_email, and client_id can all be of arbitrary length. Right now, the detector only receives the final few lines of the JSON.

provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/seafood%40seafood-204620.iam.gserviceaccount.com"
}

2677 relies on auths to match a JSON structure. It's typically only a couple characters away from the starting {, but even those few missing characters will cause the detector to break.

Proposal 2: Create a new interface that detectors can optionally implement, instructing the spanCalculator to look backwards instead of forwards.

This is a good solution in the long-term. It could dovetail nicely into a solution for matching secrets against a known prefix or suffix.

Right now the prefixes are hard-coded in the regex, meaning that this gets detected:

curl -X GET https://api.snyk.io/ -u "api:7ec9d869-25ae-4347-a21d-c24556f06d67" 

but this does not:

curl -u "api:7ec9d869-25ae-4347-a21d-c24556f06d67" https://api.snyk.io/
dustin-decker commented 4 months ago

Proposal 2 seems ideal because it seems pretty rare to need to look backward

rgmz commented 4 months ago

Proposal 2 seems ideal because it seems pretty rare to need to look backward

It's quite common; I've encountered some variation of the example posted above countless times.

Secrets that have a distinct prefix are the only case where the possibility is entirely eliminated.

ahrav commented 4 months ago

I think https://github.com/trufflesecurity/trufflehog/pull/2957 should hopefully catch most of these cases.