Closed GoogleCodeExporter closed 9 years ago
req_user_agent might be another candidate for normalization as it will change
only as often as WebPageTest does updates. 17 distinct strings currently
account for ~2.6 GB, or just about 25% of the size of the requests table.
select distinct(req_user_agent), count(*) from requests group by 1 order by 2
desc;
select sum(length(req_user_agent)) from requests;
Original comment by jbyers
on 20 Apr 2011 at 4:41
Original comment by stevesou...@gmail.com
on 18 Jul 2011 at 10:04
[deleted comment]
Based on the comments I would drop redundant information from pages, make wptid
the primary key and add a sites table. As long as the indices are respected
this would reduce the size of the database and not impact query performance.
CREATE TABLE `pages` (
`createDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`label` date DEFAULT NULL,
`wptid` varchar(64) NOT NULL,
`wptrun` int(2) unsigned NOT NULL,
`title` varchar(255) NOT NULL,
`startedDateTime` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`renderStart` int(10) unsigned DEFAULT NULL,
`onContentLoaded` int(10) unsigned DEFAULT NULL,
`onLoad` int(10) unsigned DEFAULT NULL,
`PageSpeed` int(4) unsigned DEFAULT NULL,
`rank` int(10) unsigned DEFAULT NULL,
`reqTotal` int(4) unsigned NOT NULL,
`reqHtml` int(4) unsigned NOT NULL,
`reqJS` int(4) unsigned NOT NULL,
`reqCSS` int(4) unsigned NOT NULL,
`reqImg` int(4) unsigned NOT NULL,
`reqFlash` int(4) unsigned NOT NULL,
`reqJson` int(4) unsigned NOT NULL,
`reqOther` int(4) unsigned NOT NULL,
`bytesTotal` int(10) unsigned NOT NULL,
`bytesHtml` int(10) unsigned NOT NULL,
`bytesJS` int(10) unsigned NOT NULL,
`bytesCSS` int(10) unsigned NOT NULL,
`bytesImg` int(10) unsigned NOT NULL,
`bytesFlash` int(10) unsigned NOT NULL,
`bytesJson` int(10) unsigned NOT NULL,
`bytesOther` int(10) unsigned NOT NULL,
`numDomains` int(4) unsigned NOT NULL,
`siteid` int(11) NOT NULL,
PRIMARY KEY (`wptid`),
UNIQUE KEY `label` (`label`,`siteid`),
KEY `site` (`siteid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE `sites` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` varchar(255) NOT NULL,
`added` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `newindex` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Original comment by charlie....@clark-consulting.eu
on 24 May 2012 at 10:22
I've just run some tests using this proposed schema on the last year of data.
The size of pages drops from 344 MB to 190 MB, with 8 MB for the ancillary
`sites` table. The size of the index drops only slightly - to be expected, and
`sites` has and index of 11 MB. I would expect the indices to significantly
speed up any time or site based queries.
Sample migration within the same DB to `pages_new`:
-- create a view of all sites with the date as to when they were added
insert into sites (url, added)
select
url,
min(str_to_date(`pages`.`label`,'%M %d %Y')) AS label
from pages
where url != ''
group by url
order by label desc
-- migrate to new pages table
insert into pages_new
(createdate,
label,
wptid,
wptrun,
title,
siteid,
startedDateTime,
renderStart,
onContentLoaded,
onLoad,
PageSpeed,
reqTotal,
reqHTML,
reqjs,
reqcss,
reqimg,
reqflash,
reqjson,
reqother,
bytestotal,
byteshtml,
bytesjs,
bytescss,
bytesimg,
bytesflash,
bytesjson,
bytesother,
numDomains
)
select
from_unixtime(createdate) as createdate,
str_to_date(`pages`.`label`,'%M %d %Y') AS label,
wptid,
wptrun,
title,
sites.id,
from_unixtime(startedDateTime) as startedDateTime,
renderStart,
onContentLoaded,
onLoad,
PageSpeed,
reqTotal,
reqHTML,
reqjs,
reqcss,
reqimg,
reqflash,
reqjson,
reqother,
bytestotal,
byteshtml,
bytesjs,
bytescss,
bytesimg,
bytesflash,
bytesjson,
bytesother,
numDomains
from pages
inner join sites on
(sites.url = pages.url)
where wptid != ''
Original comment by charlie....@clark-consulting.eu
on 24 May 2012 at 3:02
BTW 27 rows with empty URLs get dropped.
Original comment by charlie....@clark-consulting.eu
on 24 May 2012 at 3:03
Having checked out the source I've made an initial stab at profiling viewpage.
Simply using my suggested schema significantly boosts speed:
| 2 | 1.08975000 | select * from pages where url='http://www.ibm.com/'
and label='May 15 2012'
|
| 3 | 0.00048200 | select * from pages_new
inner join sites on (sites.id = pages_new.siteid)
where sites.url = 'http://www.ibm.com/'
and label = '2012-05-01'
Original comment by charlie....@clark-consulting.eu
on 27 May 2012 at 1:46
Not directly related to the schema but
`$hStats = getStats($gLabel, $gSlice, ($gbMobile ? "iphone" : "IE8"), $url);`
involves another call to the page query in getStatsDataForUrl()
It would make sense not to run the same query again and simply to pass in the
existing resultset. Maybe getStatsDataForRow()?
Original comment by charlie....@clark-consulting.eu
on 27 May 2012 at 3:04
Analogous to my schema simply adding indices for URL and label to `pages`
significantly improves query speeds. As the maximum length for domains names is
253 characters this can be used for the index (253 + 11 for scheme and www) or
better still converting the column to a varchar of 264. As I think that MyASM
column length constraints are enforced through stripping application logic
should check them when added as increasing the column length is
non-destructive. Speeds up query times by a factor of 100 - 200. The indices
must be added to all tables that use these columns, ie. assume real foreign key
support. This would make the micro-optimisation on the URL index redundant.
Important as this involves significant calculations in ad-hoc queries without
the foreign key such as calculating max_age caching.
Original comment by charlie....@clark-consulting.eu
on 29 May 2012 at 10:24
This bug has captured some good thoughts and comments. I've started working on
changing the schema so am closing this bug because it's too general.
From the initial comments we DO need urlShort for now because it's in the index
and has to be < 255, but I hope to replace that later with a hash. We DO need
the archive field, but I've remove harfile and urlHtml*.
Right now I don't want to do any normalization as it complicates doing ad hoc
queries as well as complicating downloading & setting up a replica. We already
have a lot of tables that have to be installed (pages, requests, crawls, &
stats). Also, my investigations show that, while there IS a huge amount of
redundancy, normalizing the tables doesn't have a significant impact - eg - it
doesn't cut it by 50% or more.
Original comment by stevesou...@gmail.com
on 10 Jan 2013 at 8:49
"Right now I don't want to do any normalization as it complicates doing ad hoc
queries as well as complicating downloading & setting up a replica."
I have a real problem with this as I think not only is it untrue, it does not
make sense: normalisation generally makes queries easier. People setting up
local instances just run the schema and import the data by script so the number
of tables is not relevant.
pages.urlShort can be dropped forthwith. A 255-character wide index on
pages.url sufficient though it is probably better to go to varchar from text.
Hash values are not needed. requests.url probably has to stay a blob because
query strings can be almost unlimited. I'm not sure what requests.urlShort is
for except perhaps for domain and protocol analysis which would both be better
normalised.
While there is potential for normalising pages, the elephant in the room is
requests which cannot be significantly normalised. I agree with your decision
to include run statistics in the pages table. I suppose that this might be
considered denormalisation if requests were anything but a log. While data
mining on requests is possible, creating more derived tables or views probably
makes more sense.
Original comment by charlie....@clark-consulting.eu
on 10 Jan 2013 at 9:13
Original issue reported on code.google.com by
stevesou...@gmail.com
on 18 Nov 2010 at 12:02