mozilla / prowac

INACTIVE - http://mzl.la/ghe-archive - Progressive Web App Crawler
Mozilla Public License 2.0
5 stars 7 forks source link

Consider open crawl data #58

Closed digitarald closed 2 years ago

digitarald commented 8 years ago

Skimmed over both to assess what kind of data they could provide:

https://commoncrawl.org/

http://httparchive.org/

We need to do some more research, but httparchive might be good alternative to Prowac crawling sites.

digitarald commented 8 years ago

@zalun can you take a look as well?

zalun commented 8 years ago

httparchive

sample queries: https://www.igvita.com/2013/06/20/http-archive-bigquery-web-performance-answers/

digitarald commented 8 years ago

httparchive

… The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized.

It is just request/response details and WebPagetest results on purpose.

Looks like they are both not a good fit, not even giving us an edge in priming the database.

miketaylr commented 8 years ago

There's also http://webdevdata.org/, but it's just HTTP requests and HTML responses for home pages.

digitarald commented 8 years ago

On top of that debdevdata.org just has about 8000 entries.

marcoscaceres commented 8 years ago

On top of that debdevdata.org just has about 8000 entries.

That's ~80,000. Law of diminishing returns kicks in pretty quick after that. Admittedly, our (webdevdata's) dataset is grossly out of date now - but a new run could be performed. I produced the following report on iOS "PWAs" previously from that old dataset.