w3c-webmob / installable-webapps

Use cases and requirements for installable web apps
43 stars 13 forks source link

Strengthen research with google API #26

Open marcoscaceres opened 10 years ago

marcoscaceres commented 10 years ago

We can probably draw on the following to strengthen the findings. It's not as accurate, but it's covers a much much larger data set so it could be used to strengthen findings.

http://git.macropus.org/meta-tag-usage/

marcoscaceres commented 10 years ago

(it's also not verifiable)

ernesto-jimenez commented 10 years ago

Wouldn't it be better to limit ourselves to more accurate and verifiable sources?

I would rather do a quick crawler that downloads a website, extracts the key information we want and discards the HTML. That would save a lot of space and could be done easily.

The uncompressed october dataset is 5.9GB while all the CSVs I've generated for webdevdata-reports are 567MB:

~/webmob-reports% du -sh webdevdata-latest/
5.9G        webdevdata-latest/
~/webmob-reports% du -sh csv_out
567M    csv_out
~/webmob-reports% wc -l csv_out/*
 1933179 csv_out/all_tags.csv
  527432 csv_out/link_tags.csv
  275799 csv_out/link_tags_stylesheet.csv
  125825 csv_out/link_tags_stylesheet_media.csv
  326287 csv_out/meta_tags.csv
    1816 csv_out/meta_tags_application_names.csv
   15926 csv_out/meta_tags_viewport.csv
  641462 csv_out/script_tags.csv
 3847726 total
marcoscaceres commented 10 years ago

On Thursday, November 28, 2013 at 11:44 PM, Ernesto Jiménez wrote:

Wouldn't it be better to limit ourselves to more accurate and verifiable sources?

I don’t think we should limit our selves. We are able to provide verifiable results, which is great - but as a secondary source that is able to show use at “web scale”, it certainly helps strengthen our argument. It gives an indication of the reach of a given feature beyond our dataset (even if unverifiable). Having said that, I strongly agree that we should not use it as a primary source - as we don’t know what each search from google actually means (could look that up).

I would rather do a quick crawler that downloads a website, extracts the key information we want and discards the HTML. That would save a lot of space and could be done easily. The uncompressed october dataset is 5.9GB while all the CSVs I've generated for webdevdata-reports (https://github.com/ernesto-jimenez/webdevdata-reports) are 567MB: ~/webmob-reports% du -sh webdevdata-latest/ 5.9G webdevdata-latest/ ~/webmob-reports% du -sh csv_out 567M csv_out ~/webmob-reports% wc -l csv_out/* 1933179 csv_out/all_tags.csv 527432 csv_out/link_tags.csv 275799 csv_out/link_tags_stylesheet.csv 125825 csv_out/link_tags_stylesheet_media.csv 326287 csv_out/meta_tags.csv 1816 csv_out/meta_tags_application_names.csv 15926 csv_out/meta_tags_viewport.csv 641462 csv_out/script_tags.csv 3847726 total

That could be quite an efficient way of doing this. If we know exactly what we are looking for, then we could broaden our search - specially if we could split the task amongst a cluster of computers. Then we could easily search the top 1,000,000 if each machine d/l 100,000 home pages in a very targeted way.