richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

unescaped Yaml field with quote #100

Closed ross-spencer closed 7 years ago

ross-spencer commented 7 years ago

Related to #30?

I've this block:

  filename : 'P:\Working Copies\Non-records\090826_1019 (D)\BBC-0030\ocr\0005437^FullText.TXT'
  filesize : 853
  modified : 2009-08-26T08:45:23+12:00
  errors   : 
  matches  :
    - ns      : pronom
      id      : x-fmt/111
      format  : 'Plain Text File'
      version : 
      mime    : 'text/plain'
      basis   : 'extension match txt; text match ISO-8859'
      warning : 
    - ns      : tika
      id      : 'text/plain'
      format  : 
      mime    : 'text/plain'
      basis   : 'extension match txt; text match ISO-8859; text match ISO-8859'
      warning : 'match on filename and text only; byte/xml signatures for this format did not match'
    - ns      : freedesktop.org
      id      : 'audio/x-ape'
      format  : 'Monkey's audio'
      mime    : 'audio/x-ape'
      basis   : 'byte match at 0, 4'
      warning : 'filename mismatch'

I think the single quote need's escaping in "Monkey's Audio"... but am not sure. Do get some errors trying to parse YAML online, e.g. http://yaml-online-parser.appspot.com/ - and original issue was spotted parsing the structure into Sqlite. Will likely add an escape to my code as well, but may catch others out.

PS. I can't help but laugh at the mimetype!

richardlehane commented 7 years ago

thanks @ross-spencer - yep, I'll need to add escaping for the format field (I think just it just quotes the field at present). Looks like a second little bug in the basis field for the tika ID too... the text match info appears twice for some reason.

Was it actually a Monkey's audio in the end, or was that a false positive?

ross-spencer commented 7 years ago

I escaped my code too: https://github.com/exponential-decay/droid-siegfried-sqlite-analysis-engine/issues/39 though there's a better way to do it in Python i need to investigate.

Unfortunately only a false positive, the sig in Freedesktop's file is:

  <magic priority="50">
  <match value="MAC " type="string" offset="0"/>
  </magic>
  <glob pattern="*.ape" weight="50"/>

The txt file happens to be an OCR of a PDF that starts with the word MASTERTON, which has been reecognized by the OCR engine as MAC. (A whole stack of things to untangle here!)

ross-spencer commented 7 years ago

Oh! I just noticed I got issue #100 :D (a good day!)

richardlehane commented 7 years ago

fixed with 1.7.3 release