richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Filenames containing ? give warning : 'extension mismatch' #129

Closed workflowsguy closed 5 years ago

workflowsguy commented 5 years ago

When files are processed with sf, those that contain a question mark at the end of the filename will be identified with the correct type, but a "extension mismatch" warning will still be output, viz.

sf "/Volumes/Public/bearbeiten/Dateien/ermitteln Dateityp/Salzburger Nachtstudio.2019-06-19 - Kulturkampf im Klassenzimmer?.mp3"
---
siegfried   : 1.7.12
scandate    : 2019-06-24T16:27:08+02:00
signature   : default.sig
created     : 2019-06-15T12:22:38+02:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V95.xml; container-signature-20180917.xml'
---
filename : '/Volumes/Public/bearbeiten/Dateien/ermitteln Dateityp/Salzburger Nachtstudio.2019-06-19 - Kulturkampf im Klassenzimmer?.mp3'
filesize : 74564436
modified : 2019-06-21T17:03:54+02:00
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/134'
    format  : 'MPEG 1/2 Audio Layer 3'
    version : 
    mime    : 'audio/mpeg'
    basis   : 'byte match at [[0 3] [74560365 1151] [74562035 1151] [74563705 3]] (signature 1/8)'
    warning : 'extension mismatch'

I am running on macOS, where ? is an allowed character for filenames.

Thanks!

richardlehane commented 5 years ago

thanks for this report workflowsguy, an interesting bug! I'll look into it

richardlehane commented 5 years ago

I've found the offending code: https://github.com/richardlehane/siegfried/blob/master/internal/namematcher/namematcher.go#L149

The issue is that some filenames are within URLs (because of WARC scanning) and where sf thinks the name is a URL it strips characters following a "?" because in a URL that's the query string. E.g. it is trying to get the name within a string like "http://www.mysite.com/file.pdf?user=richard"

But in your case where the ? is legitimately part of a regular file name, this is breaking extension matching.

I'll have a think about how to re-jig this bit of the code to fix