richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
214 stars 30 forks source link

FATAL bad glob pattern: syntax error in pattern #227

Closed gleporeNARA closed 1 year ago

gleporeNARA commented 1 year ago

I see a recent commit related to this error message. I am now getting this on my KDE Neon computer for files with brackets in the names. I am also seeing behavior where these files are being silently skipped. Setting continue on errors does not seem to work. See below:

upstairs:~/data/books/f/temp$ ls -l

-rw-rw-r-- 1 lepore lepore 3919317 Apr 10 11:50 'example - [1.epub' -rw-rw-r-- 1 lepore lepore 3919317 Apr 10 11:50 'example - 1.epub' -rw-rw-r-- 1 lepore lepore 3919317 Apr 10 11:50 'example - 1].epub'

upstairs:~/data/books/f/temp$ sf * 2023/04/10 11:51:26 [FATAL] bad glob pattern: syntax error in pattern

upstairs:~/data/books/f/temp$ sf example\ -\ 1.epub filename,filesize,modified,errors,namespace,id,format,version,mime,class,basis,warning example - 1.epub,3919317,2023-04-10T11:50:42-04:00,,pronom,fmt/483,ePub format,,application/epub+zip,Text (Structured),"extension match epub; container name mimetype with byte match at 0, 20",

upstairs:~/data/books/f/temp$ sf -coe * 2023/04/10 11:57:06 [FATAL] bad glob pattern: syntax error in pattern

gleporeNARA commented 1 year ago

This is running the siegfried_1-10-0_linux64_NO_CGO.zip version, by the way.

ross-spencer commented 1 year ago

I can confirm this behavior @gleporeNARA. See below for a brief discussion and a workaround.

Linux - doesn't work unless quoted

spencer@rvarchivespencer2:~/git/richardlehane/siegfried/cmd/sf$ ./sf -coe greg/*
2023/04/11 11:52:13 [FATAL] bad glob pattern: syntax error in pattern

But, this pattern should be valid, e.g. with $ file:

spencer@rvarchivespencer2:~/git/richardlehane/siegfried/cmd/sf$ file greg/*
greg/123[123].file:     empty
greg/[1.epub:           empty
greg/example:           empty
greg/example - [1.epub: empty
greg/example - 1.epub:  empty
greg/example - 1].epub: empty

Workaround

A workaround seems to be to quote your command: e.g. ./sf -coe "greg/*"

---
siegfried   : 1.10.0
scandate    : 2023-04-11T11:46:12+02:00
signature   : default.sig
created     : 2023-03-23T15:09:43Z
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V111.xml; container-signature-20230307.xml'
---
filename : 'greg/example - [1.epub'
filesize : 0
modified : 2023-04-11T11:29:50+02:00
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  :
    version :
    mime    :
    class   :
    basis   :
    warning : 'no match; possibilities based on extension are fmt/483'
---
...
...
...

But this isn't really good enough as it would have just worked previously on `nix systems.

Windows - just works

Which brings me to Windows where it just works...

C:\temp\siegfried\cmd\sf>sf greg/*
---
siegfried   : 1.10.0
scandate    : 2023-04-11T11:49:37+02:00
signature   : default.sig
created     : 2023-03-31T11:53:24+02:00
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V111.xml; no container signatures; built without reports'
---
filename : 'greg\example - [1.epub'
filesize : 5
modified : 2023-04-11T11:41:21+02:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/111'
    format  :
    version :
    mime    :
    class   :
    basis   : 'text match ASCII'
    warning : 'match on text only'
---
...
...
...

A regression on Linux, and improvement on Windows?

Seems that way unfortunately 😞

Option 1: Conditionally compile?

Is this another piece of code that can benefit from conditionally compiling for Windows or Linux? (precedent is the long_path fix in the same region of the code).

Option 2: Something else in code?

Is there some pre-processing of the glob patterns being missed? Some standard pattern for enabling glob that works cross-platform? I don't know what that would look like. I haven't been able to find anything in searches elaborating.

Option 3: Revert the code?

Simplest approach, but takes Windows back a step.

My vote is 1 @richardlehane and @prettybits, but what do you think?


NB, additional notes are that the square brackets are valid glob characters so that seems to be where this is causing the confusion: https://teaching.idallen.com/cst8207/15w/notes/190_glob_patterns.html#shell-glob-metacharacters

gleporeNARA commented 1 year ago

Quoting seems to be a workaround for asterisk patterns, but not for single files with brackets and spaces. See below:

$ sf "example - [1.epub" 2023/04/11 07:30:08 [FATAL] bad glob pattern: syntax error in pattern

But does work for files with brackets and no spaces:

sf "example-[1.epub" filename,filesize,modified,errors,namespace,id,format,version,mime,class,basis,warning example-[1.epub,3919317,2023-04-10T11:50:42-04:00,,pronom,fmt/483,ePub format,,application/epub+zip,Text (Structured),"extension match epub; container name mimetype with byte match at 0, 20",

ross-spencer commented 1 year ago

Quoting seems to be a workaround for asterisk patterns, but not for single files with brackets and spaces.

I can confirm that too:

spencer@rvarchivespencer2:~/git/richardlehane/siegfried/cmd/sf/greg$ ../sf "example - [1.epub"
2023/04/11 13:43:55 [FATAL] bad glob pattern: syntax error in pattern

With some wrangling you can get Ubuntu, for example, to describe the exact escape sequences required:

spencer@rvarchivespencer2:~/git/richardlehane/siegfried/cmd/sf/greg$ ../sf "example\ -\ \[1.epub"
[FILE] /home/spencer/git/richardlehane/siegfried/cmd/sf/greg/example - [1.epub
[ERROR] empty source
---
siegfried   : 1.10.0
scandate    : 2023-04-11T13:43:13+02:00
signature   : default.sig
created     : 2023-03-23T15:09:43Z
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V111.xml; container-signature-20230307.xml'
---
filename : 'example - [1.epub'
filesize : 0
modified : 2023-04-11T11:29:50+02:00
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  :
    version :
    mime    :
    class   :
    basis   :
    warning : 'no match; possibilities based on extension are fmt/483'

Again, it's not really acceptable, and a different pattern from other Linux input. Hopefully we can deliver another fix quickly.

prettybits commented 1 year ago

After having a closer look, I see a few issues here:

  1. the error message when a bad pattern is encountered should mention the failing pattern, e.g. in this case bad glob pattern: 'example - [1.epub'
  2. with the current behavior when glob expansion is handled by the shell (e.g. an unquoted asterisk on Unix shells) and some of the expanded files contain an opening square bracket execution aborts on the first such file instead of at least attempting the remaining files
  3. given that [ is a valid filename character necessitates IMO that one of two things be done:
    • check if an argument that is a bad pattern is actually an existing file/directory and treat that as the single match
    • put the glob expansion capability behind a flag like -glob, which is more explicit about the intent for how to treat arguments (e.g. fd chose that option, even if that might not be the best comparison given the different usage pattern)

I would slightly prefer the second option as that doesn't need an additional stat call for every bad pattern and as I generally prefer explicitness. Both would fall under "Option 2" per Ross' comment above, so that would be my vote. ^^

With some wrangling you can get Ubuntu, for example, to describe the exact escape sequences required: [...] sf "example\ -\ \[1.epub"

Given that you have the escape sequences inside quotes that is handled by Go's pattern syntax. But as you say, this is an unexpected and unusual requirement and shouldn't be needed.

I think a fix would be a natural extension of my PR #225. If my observations sound sensible I could build on that in separate PR once we have agreement on a way forward. What do you think?

prettybits commented 1 year ago
  1. given that [ is a valid filename character necessitates IMO that one of two things be done:
    • check if an argument that is a bad pattern is actually an existing file/directory and treat that as the single match

Thinking some more about it I'd say my first idea wouldn't really solve this fully either. Since on Linux filesystems the allowed character set is more expansive and includes ? and * I could have files called file1.txt and file?.txt. Calling sf file?.txt would expand in the shell and call siegfried for both files, but quoting it (sf "file?.txt") will still not let me identify just that file without additional escaping as that too will be expanded, this time internally as it's not a bad pattern.

So I'd say putting this behind a -glob flag would be the best option at least on Unix environments. Leaving the globbing as default behaviour on Windows without a flag (i.e. "Option 1") could be done, but that would then need the extra check I proposed for files with [ in them I think, and it would be inconsistent across platforms which I'm not sure is worth it?

EDIT: I just realized that even on Windows you can have unexpected cases with both square brackets allowed, given files file[abc].txt and filea.txt, sf file[abc].txt wouldn't be a bad pattern but match filea.txt. Since before the glob support change siegfried always matched names exactly I'd count that as another point towards putting it behind a CLI flag for all platforms.

richardlehane commented 1 year ago

if the glob feature is redundant on posix systems (because they are already getting glob expansion from their shell), and the change in 1.10 is really just a quality of life improvement for windows users, I'd vote for the conditional build approach + double-check for literal matches on bad glob pattern errors...

i.e. this option:

Leaving the globbing as default behaviour on Windows without a flag (i.e. "Option 1") could be done, but that would then need the extra check I proposed for files with [ in them I think, and it would be inconsistent across platforms which I'm not sure is worth it?

@prettybits you make a solid argument for a -glob flag but it would be a flag just for windows users (would posix users get any benefit if their shell is already doing glob expansion?); may complicate documenting the feature (i.e. if you are in a posix shell you can do sf .jpg but if you are in a windows terminal you need to do sf -glob .jpg); and generally speaking I think the fewer flags the better (lower cognitive load)

In terms of the risk of a windows user wanting a single literal match for sf file[abc].txt but getting unwanted glob expansion, could they work around by doing sf "file[abc].txt"?

sidenote: longpath.go and longpath_windows.go should be renamed to identify.go and identify_windows.go as the current filenames are pretty opaque as to the file contents

prettybits commented 1 year ago

For POSIX shells I agree that the feature is redundant for interactive usage there. I can imagine an explicit -glob flag still being useful when creating a siegfried process from other programs where the globbing could be delegated this way, but that's a minor advantage I think.

[...] I'd vote for the conditional build approach + double-check for literal matches on bad glob pattern errors...

Checking for a literal match when the pattern was fine but nothing matched would also be good in that case, which is also what Bourne shells like bash do by default, but:

In terms of the risk of a windows user wanting a single literal match for sf file[abc].txt but getting unwanted glob expansion, could they work around by doing sf "file[abc].txt"?

This would make no difference as the argument string will be the same and the expansion would still happen within siegfried with potentially unwanted results, so that's a behavior change still. Of course, should I be missing some special behavior on Windows here feel free to correct me, that environment is always good for surprises. :)

I definitely understand the concern about more options meaning higher cognitive load and documentation being tricky if the implementation differs by platform, although by now I'm having a hard time seeing a clean alternative?

richardlehane commented 1 year ago

what about if we reverse the behaviour and assume a literal path first, then, only on a bad path error, try the glob expansion?

This would seem to fix all the bugs (?), could be cross platform, has no perf impact for the happy path (which was the pre 1.10 state) & wd only slightly slow down windows users using this new facility of glob paths (as they'd get that initial failed attempt at a file open)

prettybits commented 1 year ago

That would fix the regression, but then the globbing wouldn't work in all cases anymore, i.e. given files file[ab]c.txt, fileac.txt, filebc.txt, and filecc.txt you couldn't match just fileac.txt and filebc.txt on Windows anymore.

While I suppose that would be a limitation that is rarely hit in practice it would still be surprising when it happens.

ross-spencer commented 1 year ago

@prettybits my perspective on good practice here is, 1) one has to start with fixing the regression. This resets the state for everyone including Windows. Then 2) we have to find a solution for Windows that doesn't impact users previous expectations on other platforms. Ideally, it looks and feels like it did for Linux users previously (see below), if it doesn't then other solutions may be explored, like a Windows flag.

Were I to start out this feature again, I would start with a user story that asked for the Windows globbing experience to work like that on Linux. How would I go about addressing that? If I were to add a flag for Windows then it wouldn't pass my own test in the user story but it might be a necessary evil (I don't think it is though), and if it broke 'nix systems too, then it also wouldn't pass the expectations of other users, and good practice.

it would still be surprising when it happens.

Definitely, but so was the Siegfried experience 1.9 and before!

prettybits commented 1 year ago

@ross-spencer Of course, I entirely agree that fixing the regression is the first priority, it wasn't my intention to block that by searching for a robust overall solution, sorry if it came across that way. I only wanted to point out that Richard's proposal also wouldn't solve all the bugs yet, and I wasn't sure how you'd feel about in turn regressing your original aim.

But since it sounds like you are alright with that as an interim solution I'd say

reverse the behaviour and assume a literal path first, then, only on a bad path error, try the glob expansion?

sounds like a workable approach (EDIT: with the expansion fallback limited to Windows)? After that, we could then continue discussing the remaining issues, how does that sound?

ross-spencer commented 1 year ago

I'm looking forward to helping try and iron out anything left over, but this sounds pretty good to me @prettybits!

richardlehane commented 1 year ago

resolved with v1.10.1 release, please re-open this if any further issues

mjgiarlo commented 1 year ago

@richardlehane Upgraded to 1.10.1 yesterday and it works like a charm. Thanks much!

gleporeNARA commented 1 year ago

Can also confirm this has fixed the problem. Thanks!