Normalize mentions about the october data set

Yes! absolutely.

We need to say if the sample is probabilistic or non-probabilistic (it's non-probabilistic because we don't know how many webpages there are on the Webs). Hence, we cannot generalize from it. However, the sample size n=78k, is more than appropriate for an exploratory analysis (cf. [1]).

Selection bias: the pages were selected by Alexia's ranking algorithm - hence we need to understand how they end up with this list... and if it's representative of "the world" (i.e., are all countries represented in the set, etc.). There may be language bias. We don't need to look at this, just acknowledge it.

We know some of the data may be bad if process with grep. I think that's about it. Or good enough to start.

[1] Reference: Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York: John Wiley & Sons.

w3c-webmob / installable-webapps

Normalize mentions about the october data set #22