openpreserve / fido

Format Identification for Digital Objects (FIDO) is a Python command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.
http://openpreservation.org/technology/products/fido/
Other
145 stars 47 forks source link

convert PRONOM formats to FIDO signature fails #203

Closed jmcgranahan closed 2 years ago

jmcgranahan commented 2 years ago

This is on a CentOS 7.9.2009 server, running FIDO v1.3.4 (formats-v102.xml, container-signature-20160121.xml, format_extensions.xml)

I've been trying to update our fido-signatures (just by launching "fido-update-signatures"), and at the very end of the script, I am receiving the following error:

Traceback (most recent call last):
  File "/bin/fido-update-signatures", line 9, in <module>
    load_entry_point('opf-fido==1.3.4', 'console_scripts', 'fido-update-signatures')()
  File "/usr/lib/python2.7/site-packages/fido/update_signatures.py", line 155, in main
    prepare_pronom_to_fido()
  File "/usr/lib/python2.7/site-packages/fido/prepare.py", line 621, in main
    info.load_pronom_xml(args.puid)
  File "/usr/lib/python2.7/site-packages/fido/prepare.py", line 135, in load_pronom_xml
    format_ = self.parse_pronom_xml(stream, puid_filter)
  File "/usr/lib/python2.7/site-packages/fido/prepare.py", line 221, in parse_pronom_xml
    regex = convert_to_regex(bytes, 'Little', pos, offset, max_offset)
  File "/usr/lib/python2.7/site-packages/fido/prepare.py", line 578, in convert_to_regex
    raise Exception(_convert_err_msg('Illegal character in curly', chars[i], i, chars))
Exception: Conversion: Illegal character in curly: char=' ', at pos 197 in
  786D6C6E733A7064666169643D(22|27)687474703A2F2F7777772E6169696D2E6F72672F706466612F6E732F6964*7064666169643A636F6E666F726D616E6365(3E|3D22|3D27)41(22|27|3C2F7064666169643A636F6E666F726D616E63653E){ 0-120}7064666169643A70617274(3D22|3D27|3E)31(22|27|3C2F7064666169643A706172743E)
                                                                                                                                                                                                       ^
Buffer = (?s)xmlns:pdfaid=(?:"|')http://www\.aiim\.org/pdfa/ns/id.*pdfaid:conformance(?:\x3e|="|=')A(?:"|'|\x3c/pdfaid:conformance\x3e).{

I've not been able to find anything via Google on fixing this issue and I'm hesitant to update the prepare.py file that it mentions. Ideas?

mistydemeo commented 2 years ago

Looks like the issue is in PUID 95; it genuinely does have that extra space in the original signature as of PRONOM 102. I wonder if it's an authoring mistake? I should let them know. https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=770&strPageToDisplay=signatures

Edit: I've reported it to them. I'll update here with what I hear.

Dclipsham commented 2 years ago

Thank you both, yes it effectively boils down to a copy/paste error, but as DROID is coping with it we hadn't spotted it in our own testing. We'll aim to produce a fix in the coming days. Can we credit you/your institutions with reporting the issue?

David

jmcgranahan commented 2 years ago

Thanks David. I'm glad I wasn't going insane. :-) But yes, you can credit Vanderbilt University for reporting this issue initially. Thanks!

mistydemeo commented 2 years ago

Can we credit you/your institutions with reporting the issue?

Thank you! I'm currently not affiliated with a digital preservation institution, so you can just credit me by name - Misty De Méo.

Dclipsham commented 2 years ago

This issue should now be resolved with v104 update which is now live and available to download. Please could you confirm if either way if the issue is now resolved for you?

mistydemeo commented 2 years ago

I've opened a PR to upgrade to PRONOM 104. I can confirm fido-update-signatures passes now. Thanks!

Dclipsham commented 2 years ago

Awesome, thanks @mistydemeo