New Source: HuggingFace

joeleonjr commented 2 weeks ago

Description:

This PR adds HuggingFace as a new source. Users will have the ability to scan a HF model, dataset or space. A HF token is required for all scans except basic git scans of a public model, dataset or space. That means all org/user enumeration, discussions/PR enumeration, private scanning, etc. requires a token. Tokens are free and rate limiting doesn't seem to come into play when using the API.

I added a couple test files: one to address the client functionality (since there is no golang HF package) and one to address the scanning logic. Coverage is pretty high on both, but not perfect.

Checklist:

[x] Tests passing (make test-community)?
[x] Lint passing (make lint this requires golangci-lint)?

rgmz commented 1 week ago

Some public resources require sharing your information before you can view them. Idk if this is something that can be handled programmatically, or at least logged/documented.

2024-06-26T19:08:35-04:00       error   trufflehog      failed to fetch model   {"organization": "ibm", "model": "https://huggingface.co/ibm/testing-patchtst_etth1_forecast.git", "error": "access is restricted."}
2024-06-26T19:08:40-04:00       error   trufflehog      failed to fetch model   {"model": "https://huggingface.co/ibm/testing-patchtst_etth1_forecast.git", "error": "access is restricted."}

Addendum: Git LFS support (in the future) would be a tremendous benefit for Huggingface. The nature of the platform means there's lots of large files hosted external to Git.

e.g., https://huggingface.co/dmis-lab/biobert-v1.1/blob/main/flax_model.msgpack

joeleonjr commented 1 week ago

Some public resources require sharing your information before you can view them. Idk if this is something that can be handled programmatically, or at least logged/documented.
2024-06-26T19:08:35-04:00       error   trufflehog      failed to fetch model   {"organization": "ibm", "model": "https://huggingface.co/ibm/testing-patchtst_etth1_forecast.git", "error": "access is restricted."}
2024-06-26T19:08:40-04:00       error   trufflehog      failed to fetch model   {"model": "https://huggingface.co/ibm/testing-patchtst_etth1_forecast.git", "error": "access is restricted."}
Addendum: Git LFS support (in the future) would be a tremendous benefit for Huggingface. The nature of the platform means there's lots of large files hosted external to Git.

e.g., https://huggingface.co/dmis-lab/biobert-v1.1/blob/main/flax_model.msgpack

I attempted to handle that with the “access is restricted” message. What did you envision?

Also, completely agree on LFS. But that’s a much bigger endeavor.

rgmz commented 1 week ago

I attempted to handle that with the “access is restricted” message. What did you envision?

To me, "access is restricted" implies that you can't access those models. You can, you just need to click an "I agree" button. A call to action would make this clearer; the language in HuggingFace's prompt is perfect, actually.

Does the API have a specific error code or message for "you must share your contact information"?

joeleonjr commented 1 week ago

https://huggingface.co/ibm/testing-patchtst_etth1_forecast

I hear what you're saying

I attempted to handle that with the “access is restricted” message. What did you envision?

To me, "access is restricted" implies that you can't access those models. You can, you just need to click an "I agree" button. A call to action would make this clearer; the language in HuggingFace's prompt is perfect, actually.

Does the API have a specific error code or message for "you must share your contact information"?

I hear what you're saying. The challenge is there are situations where you have to wait for an org to approve your request. The error message is the same for both types: {"error":"Access to model ibm/testing-patchtst_etth1_forecast is restricted. You must be authenticated to access it."} So afaik there's no easy/reliable way to differentiate between models that require a simple click vs. waiting for admin approval. Also, not sure if we'd want to click the agree button on behalf of user's accounts.

rgmz commented 1 week ago

Does the API have a specific error code or message for "you must share your contact information"?

It turns out that the API contains a property for gated models ("gated": "auto" | true | false,), however, you can't see that until you have access. 😩

The error message is the same for both types: {"error":"Access to model ibm/testing-patchtst_etth1_forecast is restricted. You must be authenticated to access it."}

That's only for unauthenticated requests. There seem to be three different types of errors.

It's private and you don't have access

HTTP/2 404
{"error":"Repository not found"}

It's gated and your request isn't authenticated or auth is invalid


# For some reason, excluding the 
$ curl -i "https://huggingface.co/api/models/meta-llama/Meta-Llama-3-8B" -H "Authorization: Bearer hf_fake"
HTTP/2 401

{"error":"Access to model meta-llama/Meta-Llama-3-8B is restricted. You must be authenticated to access it."}


3. It's gated and your request is authenticated.
```sh
$ curl -i "https://huggingface.co/api/models/meta-llama/Meta-Llama-3-8B" -H "Authorization: Bearer $TOKEN"
HTTP/2 403

{"error":"Access to model meta-llama/Meta-Llama-3-8B is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Meta-Llama-3-8B to ask for access."}

joeleonjr commented 1 week ago

@rgmz I just pushed a change so 403 error msgs will provide more clear information. 401s will result in an "API Key is Invalid" type error message. Lmk if you think that language is sufficient.

zricethezav commented 1 week ago

@joeleonjr PR looking good, one thing that did stand out is if you give the ./trufflehog huggingface command with no arguments the program will continue and give users a false sense of scanning something. We need to check to make sure if at least one of model, space, dataset, org, user is set

joeleonjr commented 1 week ago

@joeleonjr PR looking good, one thing that did stand out is if you give the ./trufflehog huggingface command with no arguments the program will continue and give users a false sense of scanning something. We need to check to make sure if at least one of model, space, dataset, org, user is set

Done. I followed the same logic used for GitHub.

trufflesecurity / trufflehog

New Source: HuggingFace #3000

Description:

Checklist: