praveenbankbazaar / httparchive

Automatically exported from code.google.com/p/httparchive
0 stars 0 forks source link

requests.mimeType contains in accurate data #352

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
A simple query on image types indicates a fair number of inaccurately deduced 
mimeTypes:

select "mimeType", count(*) as ct from requests 
where "mimeType" like 'image/%'
group by "mimeType"
order by ct desc

(best run if mimeType has been indexed).

Some examples include
"image/*, image/gif", "image/2011/12/debbie ruston face"

We should be using a canonical list of mime types such as those listed by the 
IANA
http://www.iana.org/assignments/media-types/image
or the standard Apache list. This could be held in a lookup table and the check 
would be either a valid mime-type or "unknown".

I'm not sure where the detection of the mimeType occurs - in DB code or is it 
happening at WPT? It looks like batch_lib.importEntries() just writes stores 
something discovered upstream:

mysql_real_escape_string($content->{ 'mimeType' }) . "'")

Does content refer to response->content returned by fetchUrl()? 

Original issue reported on code.google.com by charlie....@clark-consulting.eu on 9 Jan 2013 at 11:08

Attachments:

GoogleCodeExporter commented 9 years ago
I've looked at this extensively. While there are irregular mimetypes, they 
comprise a very small number of samples in the crawl. (You requested count(*) 
in your query but didn't show that information. It should be very small.) 

The issue is garbage in - garbage out. If the website owner provides a 
nonstandard mimetype, then we have to do our best. I disagree that "unknown" 
would be better than "image/2011/12/debbie ruston face" for two reasons. One - 
the latter is a better approximation of the actual content. Two - we want to 
capture what website owners actually did - not what they should have done.

Marking "wontfix".

Original comment by stevesou...@gmail.com on 9 Jan 2013 at 8:16

GoogleCodeExporter commented 9 years ago
I wasn't sure where the data was coming from: the http-response or analysis of 
resp-content-type. As it's the former I agree with you that the whole point of 
the archive is to record.

Count is in there as "ct" so "pjpeg" and "bmp" are still more relevant than 
"svg". The rest, apart from favicons, is basically just noise. 

For documentation it would be nice to know how resp-content-type and mime-type 
relate. For performance, being able to move from LIKE "%image%" to more 
explicit tests preferably using mime-type would be faster.

Original comment by charlie....@clark-consulting.eu on 9 Jan 2013 at 8:50