praveenbankbazaar / httparchive

Automatically exported from code.google.com/p/httparchive
0 stars 0 forks source link

viewsite.php select list is NOT accurate #359

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The select list in viewsite.php where the user can select the desired crawl 
includes ALL the crawls, regardless of whether or not the URL was included in 
that crawl. We used to have a more accurate select list using 
archiveLabelsForUrlSLOW but it was too slow. We could fix this by adding 
urlhash to the "pages" table. See related bug.

Original issue reported on code.google.com by stevesou...@gmail.com on 27 Jan 2013 at 11:46

GoogleCodeExporter commented 9 years ago
I just checked my implementation - it does not include any crawls where the 
site was not included. This means that there are no "gaps" in any of the charts 
- not sure whether this should be corrected.

For an example see http://bayou.clark-consulting.eu/site/104183/

If I do it would be with a lookup table of labels (because crawls are not 
publically available) as dates and an outer join. That should have a minimal 
effect on performance. "label" could then be normalised out of pages.

Original comment by charlie....@clark-consulting.eu on 4 Mar 2013 at 7:33

GoogleCodeExporter commented 9 years ago
I've now added this to my implementation, eg. 
http://www.mamasnewbag.org/site/15970/ - only two crawls of this site but 
charts for the whole set.

The easiest way to do this is to add a view with the dates:

CREATE VIEW date_range AS SELECT DISTINCT label FROM pages
-- ORDER BY label DESC # when label is type DATE
;

For the select list you can simply restrict the query for the site in question

SELECT label FROM pages
WHERE urlShort = ? 

For the results you can use an outer join on this and this removes the need to 
pad the result sets for charts.

SELECT date_range.label, pages.* FROM date_range
LEFT JOIN pages ON
(pages.label = date_range_label
AND
pages.urlShort = ?) 

Sorting is still best accomplished by using the date type for label.

Original comment by charlie....@clark-consulting.eu on 21 Mar 2013 at 11:26

GoogleCodeExporter commented 9 years ago
Add urlhash to pages table and created an index on it, so now can search based 
on urlhash. Must faster & more accureate.

Original comment by stevesou...@gmail.com on 21 Jul 2013 at 7:58