praveenbankbazaar / httparchive

Automatically exported from code.google.com/p/httparchive
0 stars 0 forks source link

cleanup db schema #65

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Now that we see how it's used, make some optimizations to the MySQL schema:
  - Do we need both url and urlShort in both pages and requests? We probably don't want text blobs in any table that's heavily used. 
  - Don't need "archive" field in pages.
  - Probably don't need "harfile" column in pages - we can construct this based on label and url.
  - urlHtml* in pages seems redundant. Replace it with a requestid?
  - Long term - we might want a separate table for each web site. Use merged tables for aggregate stats?

Original issue reported on code.google.com by stevesou...@gmail.com on 18 Nov 2010 at 12:02

GoogleCodeExporter commented 9 years ago
req_user_agent might be another candidate for normalization as it will change 
only as often as WebPageTest does updates. 17 distinct strings currently 
account for ~2.6 GB, or just about 25% of the size of the requests table.

select distinct(req_user_agent), count(*) from requests group by 1 order by 2 
desc;
select sum(length(req_user_agent)) from requests;

Original comment by jbyers on 20 Apr 2011 at 4:41

GoogleCodeExporter commented 9 years ago

Original comment by stevesou...@gmail.com on 18 Jul 2011 at 10:04

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Based on the comments I would drop redundant information from pages, make wptid 
the primary key and add a sites table. As long as the indices are respected 
this would reduce the size of the database and not impact query performance. 

CREATE TABLE `pages` (
  `createDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `label` date DEFAULT NULL,
  `wptid` varchar(64) NOT NULL,
  `wptrun` int(2) unsigned NOT NULL,
  `title` varchar(255) NOT NULL,
  `startedDateTime` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `renderStart` int(10) unsigned DEFAULT NULL,
  `onContentLoaded` int(10) unsigned DEFAULT NULL,
  `onLoad` int(10) unsigned DEFAULT NULL,
  `PageSpeed` int(4) unsigned DEFAULT NULL,
  `rank` int(10) unsigned DEFAULT NULL,
  `reqTotal` int(4) unsigned NOT NULL,
  `reqHtml` int(4) unsigned NOT NULL,
  `reqJS` int(4) unsigned NOT NULL,
  `reqCSS` int(4) unsigned NOT NULL,
  `reqImg` int(4) unsigned NOT NULL,
  `reqFlash` int(4) unsigned NOT NULL,
  `reqJson` int(4) unsigned NOT NULL,
  `reqOther` int(4) unsigned NOT NULL,
  `bytesTotal` int(10) unsigned NOT NULL,
  `bytesHtml` int(10) unsigned NOT NULL,
  `bytesJS` int(10) unsigned NOT NULL,
  `bytesCSS` int(10) unsigned NOT NULL,
  `bytesImg` int(10) unsigned NOT NULL,
  `bytesFlash` int(10) unsigned NOT NULL,
  `bytesJson` int(10) unsigned NOT NULL,
  `bytesOther` int(10) unsigned NOT NULL,
  `numDomains` int(4) unsigned NOT NULL,
  `siteid` int(11) NOT NULL,
  PRIMARY KEY (`wptid`),
  UNIQUE KEY `label` (`label`,`siteid`),
  KEY `site` (`siteid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE `sites` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` varchar(255) NOT NULL,
  `added` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `newindex` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Original comment by charlie....@clark-consulting.eu on 24 May 2012 at 10:22

GoogleCodeExporter commented 9 years ago
I've just run some tests using this proposed schema on the last year of data. 
The size of pages drops from 344 MB to 190 MB, with 8 MB for the ancillary 
`sites` table. The size of the index drops only slightly - to be expected, and 
`sites` has and index of 11 MB. I would expect the indices to significantly 
speed up any time or site based queries. 

Sample migration within the same DB to `pages_new`:

-- create a view of all sites with the date as to when they were added
insert into sites (url, added)
select 
url, 
min(str_to_date(`pages`.`label`,'%M %d %Y')) AS label
from pages
where url != ''
group by url
order by label desc

-- migrate to new pages table
insert into pages_new
(createdate, 
label,
wptid,
wptrun,
title,
siteid,
startedDateTime, 
renderStart,
onContentLoaded,
onLoad,
PageSpeed,
reqTotal,
reqHTML,
reqjs,
reqcss,
reqimg,
reqflash,
reqjson,
reqother,
bytestotal,
byteshtml,
bytesjs,
bytescss,
bytesimg,
bytesflash,
bytesjson,
bytesother,
numDomains
    )

select 
from_unixtime(createdate) as createdate, 
str_to_date(`pages`.`label`,'%M %d %Y') AS label,
wptid,
wptrun,
title,
sites.id,
from_unixtime(startedDateTime) as startedDateTime, 
renderStart,
onContentLoaded,
onLoad,
PageSpeed,
reqTotal,
reqHTML,
reqjs,
reqcss,
reqimg,
reqflash,
reqjson,
reqother,
bytestotal,
byteshtml,
bytesjs,
bytescss,
bytesimg,
bytesflash,
bytesjson,
bytesother,
numDomains

from pages
inner join sites on
(sites.url = pages.url)

where wptid != ''

Original comment by charlie....@clark-consulting.eu on 24 May 2012 at 3:02

GoogleCodeExporter commented 9 years ago
BTW 27 rows with empty URLs get dropped.

Original comment by charlie....@clark-consulting.eu on 24 May 2012 at 3:03

GoogleCodeExporter commented 9 years ago
Having checked out the source I've made an initial stab at profiling viewpage. 
Simply using my suggested schema significantly boosts speed:

|        2 | 1.08975000 | select * from pages where url='http://www.ibm.com/' 
and label='May 15 2012'                                                         
                                                                            |
|        3 | 0.00048200 | select * from pages_new 
inner join sites on (sites.id = pages_new.siteid)
where sites.url = 'http://www.ibm.com/'
and label = '2012-05-01'    

Original comment by charlie....@clark-consulting.eu on 27 May 2012 at 1:46

GoogleCodeExporter commented 9 years ago
Not directly related to the schema but 
`$hStats = getStats($gLabel, $gSlice, ($gbMobile ? "iphone" : "IE8"), $url);`
involves another call to the page query in getStatsDataForUrl()

It would make sense not to run the same query again and simply to pass in the 
existing resultset. Maybe getStatsDataForRow()?

Original comment by charlie....@clark-consulting.eu on 27 May 2012 at 3:04

GoogleCodeExporter commented 9 years ago
Analogous to my schema simply adding indices for URL and label to `pages` 
significantly improves query speeds. As the maximum length for domains names is 
253 characters this can be used for the index (253 + 11 for scheme and www) or 
better still converting the column to a varchar of 264. As I think that MyASM 
column length constraints are enforced through stripping application logic 
should check them when added as increasing the column length is 
non-destructive. Speeds up query times by a factor of 100 - 200. The indices 
must be added to all tables that use these columns, ie. assume real foreign key 
support. This would make the micro-optimisation on the URL index redundant. 
Important as this involves significant calculations in ad-hoc queries without 
the foreign key such as calculating max_age caching.

Original comment by charlie....@clark-consulting.eu on 29 May 2012 at 10:24

GoogleCodeExporter commented 9 years ago
This bug has captured some good thoughts and comments. I've started working on 
changing the schema so am closing this bug because it's too general. 

From the initial comments we DO need urlShort for now because it's in the index 
and has to be < 255, but I hope to replace that later with a hash. We DO need 
the archive field, but I've remove harfile and urlHtml*.

Right now I don't want to do any normalization as it complicates doing ad hoc 
queries as well as complicating downloading & setting up a replica. We already 
have a lot of tables that have to be installed (pages, requests, crawls, & 
stats). Also, my investigations show that, while there IS a huge amount of 
redundancy, normalizing the tables doesn't have a significant impact - eg - it 
doesn't cut it by 50% or more.

Original comment by stevesou...@gmail.com on 10 Jan 2013 at 8:49

GoogleCodeExporter commented 9 years ago
"Right now I don't want to do any normalization as it complicates doing ad hoc 
queries as well as complicating downloading & setting up a replica."

I have a real problem with this as I think not only is it untrue, it does not 
make sense: normalisation generally makes queries easier. People setting up 
local instances just run the schema and import the data by script so the number 
of tables is not relevant.

pages.urlShort can be dropped forthwith. A 255-character wide index on 
pages.url sufficient though it is probably better to go to varchar from text. 
Hash values are not needed. requests.url probably has to stay a blob because 
query strings can be almost unlimited. I'm not sure what requests.urlShort is 
for except perhaps for domain and protocol analysis which would both be better 
normalised.

While there is potential for normalising pages, the elephant in the room is 
requests which cannot be significantly normalised. I agree with your decision 
to include run statistics in the pages table. I suppose that this might be 
considered denormalisation if requests were anything but a log. While data 
mining on requests is possible, creating more derived tables or views probably 
makes more sense.

Original comment by charlie....@clark-consulting.eu on 10 Jan 2013 at 9:13