Closed cr closed 7 years ago
This moves much of the chunking logic once implemented for sourceupdate.py
to a chunking iterator for the sources_db.Sources
class where it is available to other run modes that will get chunking soon. The new methods are getting unit test coverage.
The way that the chunking iterator stores its state is a bit ham-fisted. It works, but I have a feeling that there is a more elegant, thread-safe way. I'm open to suggestions.
How much testing have you done on this? Have you kicked off a full run, to make sure it completes to the end without error? I see no obvious issues but that would be our first indicator whether this works as well as what we have now. If it runs fine, I'm good with landing this.
How much testing have you done on this? Have you kicked off a full run, to make sure it completes to the end without error?
I tried a number of different sizes, maximum I fully ran was 20k, I believe. I kicked off a full 500k run to verify that it selects the expected chunking parameters, but didn't let it finish. I think it is safe to induce from this that it works for larger numbers, too.
There is also unit testing.
I've begun a full update of the top sites list, so once it completes, I'll have more confidence and can approve this.
I am still testing this, but ran into some oddness.
srcupdate
with default 100k hosts was fine.-l 500000
and it ran to completion, but said that it couldn't build the full 500k list and only found around 460k secure hosts.-l 600000
and the tool seemed to be hung, perhaps half way through the host list.I'm still playing with it to try to learn more. However, at a minimum I'd expect it to run to the end of the top sites list. I'd also expect it to produce a list of around 540k secure hosts, what we currently have today.
I'll update when I get a few more runs completed.
Second attempt I passed in -l 500000 and it ran to completion, but said that it couldn't build the full 500k list and only found around 460k secure hosts.
Right. That's to be expected when the error rate is more than 50%. The conclusion is that there are either simply not enough SSL-enabled hosts in the top-1M, or that the network connection of your test machine is somewhat unstable.
Third attempt I passed in -l 600000 and the tool seemed to be hung, perhaps half way through the host list.
That's odd. The result and behavior should be precisely the same as with -l 500000
. Did you use --debug
by any chance?
Btw, you can pass -n 1
or -n 2
to limit retries. With an error rate of 50%+ the speed-up will be substantial.
I'd also expect it to produce a list of around 540k secure hosts, what we currently have today.
I'm not sure that we can realistically expect a more or less identical result here. There are a couple of things to consider, I think:
Excellent points, thank you.
I ran an update with a limit of 500k, and as expected, got 443k secure sites.
The full update took almost 20 hours. I don't remember this taking so long before.
Otherwise, this is OK, but I'm going to run srcupdate with the release branch of TLS Canary and see if that takes the same amount of time.
Ran the same update with current, non-chunked canary and time is not much different.
Approved for landing.
This is an incremental change towards #44. Should be RTL.