Filesize zero error while scanning the git repo with talisman

venkatn087 commented 2 years ago

Hi Team,

I am getting an Error message while scanning my github repo as mentioned below. Could you please let me know why as i getting an error? ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF
ERRO[0011] error reading filesize: EOF

thanks venkat

jmatias commented 2 years ago

Can you share the steps to reproduce the error?

svishwanath-tw commented 2 years ago

@venkatn087 : What version of talisman are you using ? What OS and ARCH ?

Are there size 0 files in the repository ?

Can you set TALISMAN_DEBUG or execute talisman binary with --debug flag and show us the output. You can paste it in a public/private gist and give us access (@jmatias or @svishwanath-tw ) to it. I suggest that you sanitize(clean/remove) any sensitive information in the output first.

nasreenwahab commented 2 years ago

@venkatn087 @svishwanath-tw I am also getting out of memory error when running the scan

svishwanath-tw commented 2 years ago

@nasreenwahab : Thanks for the update. I'm not convinced you and @venkatn087 are facing the same underlying issue.

From what little I've worked on the codebase, I remember that files once read remain in memory. IMO, to fix this issue, a re-architecting of the way file (git object) data is read and streamed through various detectors is warranted. This might have OS specific repercussions and I worry the fix won't be quick.

I have some follow up questions to plan next course of action: How many files / objects are in your code base ? How many commits ? How many files per commit ? What is average/median size of files ? What is the size of the largest object in the repo ?

The scan feature was built in a haste and its repercussions are showing now.

svishwanath-tw commented 2 years ago

Additional complications to factor in. If the decision is to go with blocked reads of files, how to deal with secrets at the edges of a block ? This shouldn't (usually) be a problem with source code. Unless talisman is being used to scan individual files with more than a 100 thousand lines of code. (100K is an arbitrary number)

How many files do should be maintain in memory ? One early request from promoters of talisman within thoughtworks was to display the actual secret in the output (which was later limited to upto 50 chars of output with a ... overflow indicator) . I believe this to be the reason to hold blobs in memory after the detector chain execution completes.

A (comparatively) recent re-factoring I made prevented git sub-process proliferation by using a single sub-process to read contents using batch mode of git cat-file. This replaced the earlier just in time file IO (by spawning a git cat-file at will for each detector/detector type).

The number of variables to consider are huge and I believe going forward without data is not going to yield good results.

@nasreenwahab , @venkatn087, @jmatias , @harinee

svishwanath-tw commented 2 years ago

The refactoring was an attempt to quash some annoying race condition bugs reported and analysed differently by various users of talisman both internal and external to thoughtworks.

svishwanath-tw commented 2 years ago

@jmatias : FYI

thoughtworks / talisman

Filesize zero error while scanning the git repo with talisman #374