thoughtworks / talisman

Using a pre-commit hook, Talisman validates the outgoing changeset for things that look suspicious — such as tokens, passwords, and private keys.
https://thoughtworks.github.io/talisman/
MIT License
1.89k stars 241 forks source link

Filesize zero error while scanning the git repo with talisman #374

Open venkatn087 opened 2 years ago

venkatn087 commented 2 years ago

Hi Team,

I am getting an Error message while scanning my github repo as mentioned below. Could you please let me know why as i getting an error? ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF

thanks venkat

jmatias commented 2 years ago

Can you share the steps to reproduce the error?

svishwanath-tw commented 2 years ago

@venkatn087 : What version of talisman are you using ? What OS and ARCH ?

Are there size 0 files in the repository ?

Can you set TALISMAN_DEBUG or execute talisman binary with --debug flag and show us the output. You can paste it in a public/private gist and give us access (@jmatias or @svishwanath-tw ) to it. I suggest that you sanitize(clean/remove) any sensitive information in the output first.

nasreenwahab commented 2 years ago

@venkatn087 @svishwanath-tw I am also getting out of memory error when running the scan image

svishwanath-tw commented 2 years ago

@nasreenwahab : Thanks for the update. I'm not convinced you and @venkatn087 are facing the same underlying issue.

From what little I've worked on the codebase, I remember that files once read remain in memory. IMO, to fix this issue, a re-architecting of the way file (git object) data is read and streamed through various detectors is warranted. This might have OS specific repercussions and I worry the fix won't be quick.

I have some follow up questions to plan next course of action: How many files / objects are in your code base ? How many commits ? How many files per commit ? What is average/median size of files ? What is the size of the largest object in the repo ?

The scan feature was built in a haste and its repercussions are showing now.

svishwanath-tw commented 2 years ago

Additional complications to factor in. If the decision is to go with blocked reads of files, how to deal with secrets at the edges of a block ? This shouldn't (usually) be a problem with source code. Unless talisman is being used to scan individual files with more than a 100 thousand lines of code. (100K is an arbitrary number)

How many files do should be maintain in memory ? One early request from promoters of talisman within thoughtworks was to display the actual secret in the output (which was later limited to upto 50 chars of output with a ... overflow indicator) . I believe this to be the reason to hold blobs in memory after the detector chain execution completes.

A (comparatively) recent re-factoring I made prevented git sub-process proliferation by using a single sub-process to read contents using batch mode of git cat-file. This replaced the earlier just in time file IO (by spawning a git cat-file at will for each detector/detector type).

The number of variables to consider are huge and I believe going forward without data is not going to yield good results.

@nasreenwahab , @venkatn087, @jmatias , @harinee

svishwanath-tw commented 2 years ago

The refactoring was an attempt to quash some annoying race condition bugs reported and analysed differently by various users of talisman both internal and external to thoughtworks.

Some related issues and PR's : https://github.com/thoughtworks/talisman/issues/181 https://github.com/thoughtworks/talisman/issues/203 https://github.com/thoughtworks/talisman/pull/270

svishwanath-tw commented 2 years ago

@jmatias : FYI