openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

Jhove misidendifies text file as HTML due to angle bracket #345

Open daveneiman opened 6 years ago

daveneiman commented 6 years ago

Dev Effort

1D

Description

Not sure that this will be fixed but it feels like a good prompt to try to document JHOVE's criteria for identifying HTML from plain text and XML.

The application misidentifies a text file as mime type 'text/html' due to an open angle bracket '<' at the end of one line followed by the word 'Title' at the beginning of the following line. When the '<' is moved anywhere else in the file the mime type is 'text/plain'.

Here is the content from the file:

United States- Central Intelligence Agency*

The Mediterranean basin — Scale ll 6f500000 ; Lambert conformal conic

proj. (W 21°—E 60O/N 49°--N 20°). —

[Washington : Central Intelligence Agency* 1986]

1 map : col. ; 39 X 108 cm. Countries area—tinted Includes notes '•300342 (A05054) 6-86*"

1* Mediterranean iiegioa—Maps< Title

10 NOV 95 CSSH HWTlsl 87-691121

400263499.txt

daveneiman commented 6 years ago

The '<' followed by 'Title' on the next line of the sample file appears to meet the heuristic of an HTML file (even though it is Plain Text) as described in edu.harvard.hul.ois.jhove.module.HtmlModule#checkSignatures(File file, InputStream stream, RepInfo info)

This seems to be a rare edge case that came up in the processing of our files.

MartinSpeller commented 4 years ago

Jhove misidendifies text file as HTML due to angle bracket #345 - Assigned to TBA