servo / servo-warc-tests

Test Servo on Web Archive snapshots of real web sites
Mozilla Public License 2.0
11 stars 18 forks source link

Tracking issue for adding web archives for the Alexa top 25 sites #37

Closed asajeffrey closed 5 years ago

asajeffrey commented 6 years ago

The Alexa top web sites are at: https://www.alexa.com/topsites. Once #8 lands, we'll be tracking the top 10, it would be nice to have the top 25. The missing ones are:

There are instructions for playing and recording web archives for Servo at https://github.com/servo/servo-warc-tests/blob/master/README.md.

The list of archives used for Servo performance testing is at https://github.com/servo/servo-warc-tests/blob/master/ARCHIVES

Please help out by recording web archives for us!

You can do that by going to one of the issues, and assigning yourself.

tigercosmos commented 6 years ago

I think it's better to write an auto script to get archives from Alexa 500.

asajeffrey commented 6 years ago

The problem is that there's a bit of hand-curating involved, I check the archives by eye before committing them. At some point we might want to scale up to full automation, but I'm not sure we're quite there yet.

tigercosmos commented 6 years ago

It is also a good choice to use moz's top 500.

tigercosmos commented 6 years ago

Alexa Internet creates a list of the top 1,000,000 sites on the web. It's updated daily. http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

data-pup commented 6 years ago

Hello! I can work on some of these. I can start with #40 and #42.

pradyunsg commented 5 years ago

FWIW, PRs have been submitted for all the domains listed here.

asajeffrey commented 5 years ago

Yes, I built up a bit of a backlog there, but this is now done. Thanks everyone!