openpreserve / fido

Format Identification for Digital Objects (FIDO) is a Python command-line tool to identify the file formats of digital objects. It is designed for simple integration into automated work-flows.
http://openpreservation.org/technology/products/fido/
Other
145 stars 47 forks source link

Fix fetching example URLs when updating signature #84

Closed Hwesta closed 7 years ago

Hwesta commented 7 years ago

When updating signatures, if the format has a ReferenceFileIdentifier of type URL, we include a reference to it, including fetching it and calculating a checksum. However, ReferenceFileIdentifier is not consistent in its meaning or format.

Eg from PRONOM 88 where fmt/11 starts with a www, and the URL is actually a PNG

<ReferenceFileIdentifier>
  <Identifier>www.w3.org/Graphics/PNG/nurbcup2si.png</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>
...
<ReferenceFileIdentifier>
  <Identifier>www.w3.org/Graphics/PNG/666.png</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>

compared to fmt/569, which starts with http:// and is a HTML page linking to examples

<ReferenceFileIdentifier>
  <Identifier>http://www.matroska.org/downloads/test_w1.html</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>

When parsing it, we prepend http:// and fetch it, which breaks with http://www.matroska.org/downloads/test_w1.html

url = "http://" + get_text_tna(id, 'Identifier')
...
sock = urlopen(url)

Options include removing the examples and checksums from formats-v##.xml, or adding error handling around that section.

jhsimpson commented 7 years ago

This issue appears to be fixed by #101