richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
225 stars 30 forks source link

Partially revert glob support to Windows-only fallback #229

Closed prettybits closed 1 year ago

prettybits commented 1 year ago

This implements the approach discussed in #227 to remedy the reported regression by reverting to treating arguments as literal paths first and only on Windows attempting to match glob patterns when no direct match has been found.

I decided to implement the Windows-only behavior by checking against runtime.GOOS which is statically compiled into the executable as well and made more sense to me in this context than splitting this up across files. If you prefer I can certainly do that too. Also, to reduce the cases where a matching is attempted a bit I added a check for the relevant characters that could mark a valid pattern in the name.

I retained the changed matches/match variable names from #225 (which this PR subsumes) as you indicated you were fine with it, hope I got that right.

As already discussed, this won't solve all of the potentially surprising cases but will help for the most urgent and reported ones, we can keep further discussion for those in the original issue or open a separate one as you prefer. Seeing https://github.com/sul-dlss/technical-metadata-service/issues/437#issuecomment-1512236617 I'm slightly worried that there could already be more users that adjusted their integrations to shell-escape the arguments which they will have to revert should this PR get merged and eventually released.

Finally, I was wondering about whether the *coe "continue on errors" flag should have an effect on error handling across arguments and matches in addition to just the walk-function? I didn't add anything in this direction here, as this will be better discussed separately, just leaving this here for now while it's fresh in my head.

richardlehane commented 1 year ago

thanks Bernhard!

richardlehane commented 1 year ago

can you move the os.Lstat call within the "if runtime.GOOS = "windows"" block so that linux users avoid the extra syscall?

just realised os.Lstat may be insufficient to test existence of files on windows due to the long path name issue, see e.g. this old issue https://github.com/richardlehane/siegfried/issues/58. I.e. if the given path is too long then a normal lstat will fail even if the path exists

suggest adding a new exists(path) bool function in the longpath and longpath_windows files. This would do a lstat on unix. On windows it would do lstat then the retrystat func defined in the same file.

You could then do if runtime.GOOS = "windows" && string.Contains("v", ....) && !exists(v) {...} else {...}

prettybits commented 1 year ago

can you move the os.Lstat call within the "if runtime.GOOS = "windows"" block so that linux users avoid the extra syscall?

Good call, totally overlooked that, done.

just realised os.Lstat may be insufficient to test existence of files on windows due to the long path name issue, see e.g. this old issue https://github.com/richardlehane/siegfried/issues/58. I.e. if the given path is too long then a normal lstat will fail even if the path exists

Right, I took the opportunity to follow the trail a bit and found that Go gradually added longpath handling since 1.8 and the current state even builds the native support available since Windows 10, Version 1607 into compiled Windows executables. At the same time, the handling implemented still leaves UNC paths untouched and doesn't handle relative paths, which makes understanding what the right thing to do really is quite confusing to me.

suggest adding a new exists(path) bool function in the longpath and longpath_windows files. This would do a lstat on unix. On windows it would do lstat then the retrystat func defined in the same file.

Given that the native longpath handling is incomplete I think this is probably still a good thing to do, thanks for the suggestion! I decided to implement this as tryStat(path string) error so that I have access to the *PathError and can display the correct error message. I thought it a good idea to wrap that error in a walkError, so that the messages are consistent.

Looking at the error display there I notice that the messages are somewhat redundant:

[FATAL] file access error for example- [1.epub: lstat example- [1.epub: no such file or directory

Do you think it might a good idea to do a errors.Unwrap(we.err) to not show the name twice?

mjgiarlo commented 1 year ago

@prettybits 💬

Seeing https://github.com/sul-dlss/technical-metadata-service/issues/437#issuecomment-1512236617 I'm slightly worried that there could already be more users that adjusted their integrations to shell-escape the arguments which they will have to revert should this PR get merged and eventually released.

👋🏻 Hi! In fact, I'm already looking at reverting ☝🏻 change as it broke some other stuff (unrelated to sf). We are eagerly interested in an sf patch as soon as that's available.

richardlehane commented 1 year ago

@prettybits 💬

Seeing sul-dlss/technical-metadata-service#437 (comment) I'm slightly worried that there could already be more users that adjusted their integrations to shell-escape the arguments which they will have to revert should this PR get merged and eventually released.

👋🏻 Hi! In fact, I'm already looking at reverting ☝🏻 change as it broke some other stuff (unrelated to sf). We are eagerly interested in an sf patch as soon as that's available.

I'll try to get a point release out over the weekend incorporating this fix + fixes for a few other things 1.10 broke (e.g. the deb packages)

mjgiarlo commented 1 year ago

@richardlehane Much obliged!