mozilla / fx-private-relay

Keep your email safe from hackers and trackers. Make an email alias with 1 click, and keep your address to yourself.
https://relay.firefox.com
Other
1.45k stars 172 forks source link

Scrape top X Alexa sites? #654

Open pdehaan opened 3 years ago

pdehaan commented 3 years ago

Wondering if there is a clever way to use Selenium/Webdriver to scrape the https://www.alexa.com/topsites list (with and without the add-on installed) and see how the images differ.

The one catch seems to be we'd have to visit the various sites from the Alexa list and find the signup forms for each site so we can see the relay icon in action. So it'd be a pretty tedious task to manually go to each Alexa topsite and find the signup page where it prompts you for an email address.

pdehaan commented 3 years ago
const alexa = require("alexa-top-sites");

main();

async function main() {
  try {
    let { sites } = await alexa.global();
    sites = sites.map((site) => new URL(site).hostname);

    console.log("| # | HOSTNAME | LOGIN | NOTES |\n|---|----------|-------|-------|");
    sites
      // .sort() // Sort by hostname to group similar domains.
      .forEach((site, idx) => console.log(`| ${idx + 1}. | ${site} |  |  |`));
  } catch (err) {
    console.error(err);
    process.exitCode = 1;
  }
}
# HOSTNAME LOGIN NOTES
1. google.com ??
2. youtube.com see [1] above
3. baidu.com ??
4. qq.co ??
5. sohu.com ??
6. facebook.com ??
7. taobao.com ??
9. 360.cn ??
10. jd.com ??
11. yahoo.com login
12. amazon.com login
13. wikipedia.org login
14. sina.com.cn ??
15. weibo.com ??
16. reddit.com login
17. live.com login
18. zoom.us login
19. netflix.com login
20. xinhuanet.com ??
21. okezone.com -- input type=text
22. microsoft.com login
23. instagram.com ??
24. vk.com ??
25. office.com login
26. alipay.com ??
27. myshopify.com login
28. csdn.net ??
29. yahoo.co.jp See [11] above
30. bongacams.com -- input type=text
31. twitch.tv ??
32. panda.tv ??
33. zhanqi.tv ??
34. google.com.hk -- see [1] above
35. bing.com login
36. naver.com login
37. aliexpress.com ??
38. ebay.com ??
39. china.com.cn ??
40. microsoftonline.com ??
41. amazon.in see [12] above
42. tianya.cn ??
43. stackoverflow.com login
44. twitter.com ??
45. tribunnews.com ??
46. amazon.co.jp login
47. google.co.in see [1] above
48. chaturbate.com login
pdehaan commented 3 years ago

The always excellent @jrbenny35 helped me debug https://github.com/pdehaan/wdio-firefox-addon-test proof of concept repo which uses webdriver.io to launch Firefox w/ a signed XPI from AMO.

Still need to do some cleanup and maybe see if I can switch from geckodriver to a Firefox Nightly so I don't have to use signed XPIs.

birdsarah commented 3 years ago

We gathered ~400 html samples from top sites and labeled the email fields for the ML email field detection that's now in relay.

Some or all of those samples could perhaps be used. What are you trying to accomplish with the screenshot diff?

pdehaan commented 3 years ago

What are you trying to accomplish with the screenshot diff?

I think we had some early issues w/ the add-on icon appearing in weird places, or being larger than expected. Didn't know if doing a screenshot diff of pre-add-on and post-add-on and seeing if the only difference is the new icon that appears vs something else. But I haven't look at image diffing tools lately, and they were a bit of a hot mess last time I tried.

Maybe we don't even need the diffing though. Maybe just a screenshot of a page with the add-on installed and a quick manual look to make sure it doesn't look offensively weirdtm.

pdehaan commented 3 years ago

Ref: https://github.com/mozilla/fx-private-relay/issues/571 shows some early (fixed?) issues w/ Google and Facebook and LastPass. Or https://github.com/mozilla/fx-private-relay/issues/352 with Norton Password Manager.

Of course, testing compatibility w/ other installed add-ons is going to be a lot trickier. But maybe we could adapt the wdio screenshot thing above [that I completely forgot about] to preinstall some other Firefox add-ons and regenerate a new set of images to verify.

But probably most important task would be somebody trying to identify a list of potential AMO add-ons/categories which might cause mid-air collisions with our add-on (if we're both trying to inject icons into the same physical space and time).