Closed GoogleCodeExporter closed 9 years ago
I've looked at this extensively. While there are irregular mimetypes, they
comprise a very small number of samples in the crawl. (You requested count(*)
in your query but didn't show that information. It should be very small.)
The issue is garbage in - garbage out. If the website owner provides a
nonstandard mimetype, then we have to do our best. I disagree that "unknown"
would be better than "image/2011/12/debbie ruston face" for two reasons. One -
the latter is a better approximation of the actual content. Two - we want to
capture what website owners actually did - not what they should have done.
Marking "wontfix".
Original comment by stevesou...@gmail.com
on 9 Jan 2013 at 8:16
I wasn't sure where the data was coming from: the http-response or analysis of
resp-content-type. As it's the former I agree with you that the whole point of
the archive is to record.
Count is in there as "ct" so "pjpeg" and "bmp" are still more relevant than
"svg". The rest, apart from favicons, is basically just noise.
For documentation it would be nice to know how resp-content-type and mime-type
relate. For performance, being able to move from LIKE "%image%" to more
explicit tests preferably using mime-type would be faster.
Original comment by charlie....@clark-consulting.eu
on 9 Jan 2013 at 8:50
Original issue reported on code.google.com by
charlie....@clark-consulting.eu
on 9 Jan 2013 at 11:08Attachments: