Investigate differences between HIBP site and Firefox Monitor scan

pdehaan commented 6 years ago

It looks like if I Ctrl+F the https://haveibeenpwned.com/account/test@example.com page [for "Compromised data:"], I get "61 breached sites" (and 48 found pastes).

If I scan the monitor.firefox.com site for test@example.com, and search for "Compromised data:", I get 68 results (which is 7 more than the HIBP site).

lesleyjanenorton commented 6 years ago

Interesting! I managed to not even notice the hibp 'pastes' before. Was there a decision at some point to not cover/address 'pastes'? I watched a few of the user tests yesterday and at least one of the testers said they were confused about some of the breaches associated with their email -- I wonder if perhaps we are capturing pastes and calling them breaches??

lesleyjanenorton commented 6 years ago

Either way, not sure what is causing the discrepancy in reported breaches/pastes. Will do some digging.

lesleyjanenorton commented 6 years ago

curiouser and curiouser...

Missing from LocalHost when scanning 'test@example.com':

2844Breaches
8tracks
CashCrate
Ticketfly
TrickSpamBotnet
MailRu

Also of note, the order is in some cases totally different.

pdehaan commented 6 years ago

I'm not proud of this, but it may work as a starting point:

const got = require("got");
const { JSDOM } = require("jsdom");

async function hibp() {
  const dom = await JSDOM.fromURL("https://haveibeenpwned.com/account/test@example.com", {});
  const images = dom.window.document.querySelectorAll("div.pwnedWebsite .pwnLogo");
  return Array.from(images).map(img => img.src.replace(/.*\/(.*?)\.(svg|png)$/i, "$1"));
}

async function firefoxMonitor() {
  const options = {
    form: true,
    body: {emailHash:"567159D622FFBB50B11B0EFD307BE358624A26EE"}
  };
  const res = await got.post("https://monitor.firefox.com/scan", options);
  const dom = new JSDOM(res.body.toString());
  const images = dom.window.document.querySelectorAll(".image-wrap img");
  return Array.from(images).map(img => img.src.replace(/^img\/logos\/(.*?)\.(svg|png)$/i, "$1"));
}

async function main() {
  const _hibp = await hibp();
  const _monitor = await firefoxMonitor();

  const hibpNotMonitor = _hibp.filter(name => !_monitor.includes(name));
  const monitorNotHIBP = _monitor.filter(name => !_hibp.includes(name));

  console.log("HIBP, but not Monitor:", hibpNotMonitor.join(", "));
  console.log("Monitor, but not HIBP:", monitorNotHIBP.join(", "));
}

main();

Output:

$ node index

HIBP, but not Monitor: Gaadi, Yatra
Monitor, but not HIBP: AshleyMadison, Badoo, Fling, FreedomHostingII, JustDate, Mate1, TheFappening, VTightGel, Zoosk

But per our Slack conversations this morning, the "Monitor, but not HIBP" list above, all seems to correspond with the "IsSensitive": true results...

$ curl https://haveibeenpwned.com/api/v2/breaches | jq '.[] | select(.IsSensitive==true) | .Title'

"Adult Friend Finder"
"Ashley Madison"
"Badoo"
"Beautiful People"
"Bestialitysextaboo"
"Brazzers"
"CrimeAgency vBulletin Hacks"
"Eroticy"
"Fling"
"Florida Virtual School"
"Freedom Hosting II"
"Fridae"
"Fur Affinity"
"HongFire"
"Justdate.com"
"Mate1.com"
"Muslim Match"
"Naughty America"
"Non Nude Girls"
"Rosebutt Board"
"The Candid Board"
"The Fappening"
"V-Tight Gel"
"xHamster"
"YouPorn"
"Zoosk"

groovecoder commented 6 years ago

Am I caught up that this was caused by the sensitive breaches & spam lists? If so, that was fixed in https://github.com/mozilla/blurts-server/pull/235 right?

nhnt11 commented 6 years ago

Pretty sure this is now understood and we are deliberate in the way we filter breaches now. I'd close this, but maybe we should wait for @lesleyjanenorton or @pdehaan to confirm.

pdehaan commented 6 years ago

Yeah, we're now deliberate in showing different results versus the HIBP site, since we're not showing the unverified breaches (see https://github.com/mozilla/blurts-server/pull/235#issuecomment-406045403 for more context).

I think we're OK to close this issue, unless somebody still has specific concerns.

mozilla / blurts-server

Investigate differences between HIBP site and Firefox Monitor scan #233

Output: