richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Allow siegfried to follow symlinks #245

Closed max-moser closed 4 months ago

max-moser commented 9 months ago

Right now, siegfried refuses to scan symbolic links with an error message file is of type symlink; only regular files can be scanned:

mmoser@mx ~ $ ll stderr.log
Permissions Size User   Group  Date Modified Name
lrwxrwxrwx     - mmoser mmoser 27 Feb 13:54  stderr.log -> sf.log

mmoser@mx ~ $ ~/go/bin/sf stderr.log
[FILE] /home/mmoser/stderr.log
[ERROR] file is of type symlink; only regular files can be scanned
---
siegfried   : 1.10.1
scandate    : 2024-02-27T13:55:41+01:00
signature   : default.sig
created     : 2023-12-17T15:38:39+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V116.xml; container-signature-20231127.xml'
---
filename : 'stderr.log'
filesize : 6
modified : 2024-02-27T13:54:56+01:00
errors   : 'file is of type symlink; only regular files can be scanned'
matches  :

It would be nice to have an option (-s/-follow-symlinks?) that makes siegfried follow the symlink to the file to analyze.

max-moser commented 9 months ago

For us, the use case is that we have files on disk where the original filename has been "lost" from the FS perspective – they've been renamed to simply data, and the original filename is only available from another source. One way to go about this would be to create an appropriately named symlink to the actual file to analyze.

This can make the difference between correctly identifying a Python file vs. misidentifying it as Plain Text.

richardlehane commented 8 months ago

Thanks for this Max. I'm not sure why this symlink block was put in, perhaps to stop recursion into symlinked directories which might cause cycles?

Should be possible to add it.

A couple of questions:

max-moser commented 8 months ago

I can see the threat of endless cycles with symlinked directories, that could quickly (or rather slowly) spoil somebody's day :sweat_smile:

Regarding your questions:

max-moser commented 8 months ago

A feasible alternative in our case would actually be to honor the -name option when reading a file from disk rather than from stdin – effectively telling siegfried that "the name on disk is wrong, here's the proper one"

ross-spencer commented 8 months ago

More on the original rationale and potential to add symlink behavior back in: https://github.com/richardlehane/siegfried/issues/107#issuecomment-334128810

richardlehane commented 5 months ago

Hi Max, I've added a new -sym flag that you can use to follow symbolic links to files. To add it to your default configuration do sf -setconf -sym. This will be available in next release, out in about a week. I've you'd like to test, there's a release candidate here: https://github.com/richardlehane/siegfried/releases/tag/v1.11.1-rc4

max-moser commented 4 months ago

Awesome, thanks a lot!