Adding centralized Sources chunking

mozilla / tls-canary

DEPRECATED - TLS regression scanner for Firefox

https://tlscanary.mozilla.org/

Mozilla Public License 2.0

18 stars 15 forks source link

Adding centralized Sources chunking #103

Closed cr closed 7 years ago

cr commented 7 years ago

This is an incremental change towards #44. Should be RTL.

cr commented 7 years ago

Summary

This moves much of the chunking logic once implemented for sourceupdate.py to a chunking iterator for the sources_db.Sources class where it is available to other run modes that will get chunking soon. The new methods are getting unit test coverage.

The way that the chunking iterator stores its state is a bit ham-fisted. It works, but I have a feeling that there is a more elegant, thread-safe way. I'm open to suggestions.

mwobensmith commented 7 years ago

How much testing have you done on this? Have you kicked off a full run, to make sure it completes to the end without error? I see no obvious issues but that would be our first indicator whether this works as well as what we have now. If it runs fine, I'm good with landing this.

cr commented 7 years ago

How much testing have you done on this? Have you kicked off a full run, to make sure it completes to the end without error?

I tried a number of different sizes, maximum I fully ran was 20k, I believe. I kicked off a full 500k run to verify that it selects the expected chunking parameters, but didn't let it finish. I think it is safe to induce from this that it works for larger numbers, too.

There is also unit testing.

mwobensmith commented 7 years ago

I've begun a full update of the top sites list, so once it completes, I'll have more confidence and can approve this.

mwobensmith commented 7 years ago

I am still testing this, but ran into some oddness.

First attempt running srcupdate with default 100k hosts was fine.
Second attempt I passed in -l 500000 and it ran to completion, but said that it couldn't build the full 500k list and only found around 460k secure hosts.
Third attempt I passed in -l 600000 and the tool seemed to be hung, perhaps half way through the host list.

I'm still playing with it to try to learn more. However, at a minimum I'd expect it to run to the end of the top sites list. I'd also expect it to produce a list of around 540k secure hosts, what we currently have today.

I'll update when I get a few more runs completed.

cr commented 7 years ago

Second attempt I passed in -l 500000 and it ran to completion, but said that it couldn't build the full 500k list and only found around 460k secure hosts.

Right. That's to be expected when the error rate is more than 50%. The conclusion is that there are either simply not enough SSL-enabled hosts in the top-1M, or that the network connection of your test machine is somewhat unstable.

Third attempt I passed in -l 600000 and the tool seemed to be hung, perhaps half way through the host list.

That's odd. The result and behavior should be precisely the same as with -l 500000. Did you use --debug by any chance?

Btw, you can pass -n 1 or -n 2 to limit retries. With an error rate of 50%+ the speed-up will be substantial.

cr commented 7 years ago

I'd also expect it to produce a list of around 540k secure hosts, what we currently have today.

I'm not sure that we can realistically expect a more or less identical result here. There are a couple of things to consider, I think:

The base material is different. We're using Umbrella Top 1M now while you were using Alexa for your last update. There might be an inherent difference between the two, rooted in the very different way they're generated (DNS vs. browser toolbar).
Our current 540k top list has a rather large inherent error rate of some 20%, IIRC. We might possibly be looking at the effect of stricter – and probably better – filtering, resulting in a more effectively stripped-down list.

mwobensmith commented 7 years ago

Excellent points, thank you.

mwobensmith commented 7 years ago

I ran an update with a limit of 500k, and as expected, got 443k secure sites.

The full update took almost 20 hours. I don't remember this taking so long before.

Otherwise, this is OK, but I'm going to run srcupdate with the release branch of TLS Canary and see if that takes the same amount of time.

mwobensmith commented 7 years ago

Ran the same update with current, non-chunked canary and time is not much different.

Approved for landing.