qa(torsf): figure out proper configuration to help snowflake devs collecting useful data

This issue is about getting feedback from Snowflake developers on torsf. The super brief problem statement is that we're seeing tons of bootstrap issues when running on mobile. We'll consider this issue done when we'll have discussed the problem with Snowflake developers and figured out the best way to configure torsf in production.

The structure of this issue is the following:

problem statement
configurations
discussion
measurements
questions for snowflake developers

(Sadly, I did not manage to compress the information further.)

Problem statement

The torsf experiment bootstraps tor using Snowflake (the logic is at torsf.go#107). We start tor with command line options telling it to use as pluggable transport the ooniprobe client itself listening for SOCKS5 connections on a port (the mechanism is at tor.go#58). The port will forward traffic using Snowflake (the mechanism is at ptx.go#210).

We adopted a torsf configuration where we use rendezvous with the broker and we use a new, temporary tor datadir every time, thus performing a cold bootstrap.

With this configuration, we're having significant bootstrap timeout issues on mobile.

We've seen that changing the configuration makes the bootstrap more likely to succeed. It is unclear whether changing this configuration is leading us to produce useful results, though.

Hence, the need input from Snowflake developers to understand how to proceed.

Configurations

Let us call rendezvous the current configuration because it performs a rendezvous with the broker endpoint URL (we tested both "https://snowflake-broker.torproject.net.global.prod.fastly.net/" and "https://snowflake-broker.torproject.net/", which is the correct one? I suppose the first one for circumvention reasons, but maybe I'm missing something here?). Three other configurations of torsf are possible.

The first alternative configuration is AMP. In this configuration we use the AMP cache instead of the rendezvous.

The second alternative configuration uses the rendezvous and uses a persistent directory for the tor data directory. This means that the first bootstrap is going to be cold. Subsequent bootstraps will have a (sometimes partial) cache of micro-descriptors already stored on the disk. As a result, tor would need to exchange significantly less information in order to bootstrap. Given Snowflake's bandwidth constraints this seems to converge faster (we'll see the data later).

The final alternative configuration uses AMP and a persistent tor data directory.

To recap:

Name	snowflake mechanism	tor cache on disk
rendezvous	rendezvous with broker	temporary directory
amp	using amp	temporary directory
cache	rendezvous with broker	persistent directory
amp+cache	using amp	persistent directory

Discussion

The choice of whether to use AMP or the rendezvous may have an impact on the bootstrap time (we'll see measurements soon) and certainly has an implication in terms of censorship circumvention. The cache is most likely if not certainly making the bootstrap faster because tor needs to fetch less data over the (bandwidth constrained?) Snowflake.

The key question however is what are we measuring? Do we want to measure the total time tor takes to bootstrap from scratch when using Snowflake? Do we want to measure whether tor would bootstrap with Snowflake given a cache?

When asking internally this question, we were conscious that choosing to use a cache will certainly be a problem in terms of making any statement regarding the bootstrap time.

Measurements

We tested torsf on Desktop and on mobile. The original issue describing the measurements is https://github.com/ooni/probe/issues/1917. In this issue I'll try to just summarize the most relevant results of analyzing the measurements.

Our repeated desktop measurements results are summarized by the following table (40 repetitions):

Configuration	Median bootstrap time
rendezvous	57.0 s
cache	7.8 s
amp	95.0 s
amp+cache	9.5 s

So, I would conclude from this data that cache really makes a significant difference (of course, once it's filled), while AMP may have slightly worst performance but they still in the domain of "comparable" results.

Mobile measurements, though, are extremely more problematic. Here's a table with results on Android:

Configuration	Bootstrap timeout	Number of timeouts	Number of runs
rendezvous	600 s	10	10
rendezvous	900 s	3	4

What is interesting, if we read the logcat is that tor says "Delaying directory fetches: No running bridges". If think this could mean that tor will try continuing the bootstrap at a later time. So, I think that after this message the bootstrap should be considered failed. Now, the obvious question to ask to Snowflake developers is whether this assumption is true.

Interestingly, with caching enabled, I got these results:

Run	Bootstrap time
1	10.9 s
2	17.9 s
3	12.6 s
4	4.8 s
5	2.5 s
6	24.2 s
7	16.9 s
8	2.47 s

(I also tried to put the temporary cache in the app-specific directory rather than in the temporary per-app space, under the assumption that the temporary area was too slow, but actually nothing really changed.)

As an extra data point: a OONI user who helped us testing these patches, @yeganathan18, reported that the rendezvous configuration was bootstrapping more frequently than it did for us (3 times out of 7) in measurements he run in India. This result was quite puzzling to me, since I did not expect to see variability depending on the geographic location and I would have expected this person to see mostly timeouts like I did. (Should I have expected it?)

Other measurements from other countries, though, confirmed that our default configuration does not often bootstrap with a 600 s timeout. OTOH, those measurements also show the cache helping a lot.

Questions for Snowflake developers

Do these mobile performance with and without caching match your experience?
Is it correct to say that after tor says "Delaying directory fetches: No running bridges" it's basically game over and the bootstrap will not converge until tor decides to try handshaking again? (And this until is certainly longer than the maximum time we're willing to wait for an interactive OONI experiment?)
Do you think we should be measuring by default using AMP or using the rendezvous mechanism? That is, which data point would be more useful to you? Should we do both together? Should we choose at random? (Of course, I think we should also include data about the mechanism being used in the measurement, otherwise it's pointless)
Assuming the answer to question 1 is that these results we see are expected, what is the most useful measurement we can implement for you? Is it more useful to know that the Snowflake-assisted bootstrap times out often or is it more useful to know that we could bootstrap using Snowflake although the cache makes the bootstrap time more difficult to compare?

(FTR I've mentioned this issue in the Snowflake issue tracker: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40097)

Hey! This is really amazing work, thanks for the detailed writeup. I'm excited for when the test results eventually come in and what it can tell us about Snowflake reachability! I have several comments and a summary of answers to your 4 enumerated questions below.

Let us call rendezvous the current configuration [snip]. Three other configurations of torsf are possible.

This is just a slight nitpick on wording:

We've been using rendezvous method to refer to different ways of contacting the broker. The current configuration in Tor Browser and the default configuration you're presenting here uses domain fronting as the rendezvous method. AMP cache is an alternative rendezvous method. So I'd recommend the following naming schemes for the different configurations you've shown here:

domain fronting
amp
domain fronting + cache
amp + cache

(we tested both "https://snowflake-broker.torproject.net.global.prod.fastly.net/" and "https://snowflake-broker.torproject.net/", which is the correct one? I suppose the first one for circumvention reasons, but maybe I'm missing something here?).

Just took a look at how you're using this. Your client configuration is essentially:

ClientConfig{
    BrokerURL: "https://snowflake-broker.torproject.net.global.prod.fastly.net/",
    FrontDomain: "cdn.sstatic.net",
}

This is correct and it's what Tor Browser is configured to use.

The BrokerURL here isn't seen by the censor, it's included inside the TLS encrypted HTTP request to the front domain. The reason for using the fastly URL is to have traffic redirected to the right place. Our account at fastly has https://snowflake-broker.torproject.net.global.prod.fastly.net/ set up to forward traffic to https://snowflake-broker.torproject.net/.

The cache is most likely if not certainly making the bootstrap faster because tor needs to fetch less data over the (bandwidth constrained?) Snowflake. The key question however is what are we measuring? Do we want to measure the total time tor takes to bootstrap from scratch when using Snowflake? Do we want to measure whether tor would bootstrap with Snowflake given a cache? When asking internally this question, we were conscious that choosing to use a cache will certainly be a problem in terms of making any statement regarding the bootstrap time.

This is a great question and you're right that caching on the client side will decrease the bootstrap time. It will also be hard to differentiate the measurements between first time clients and clients making cached connections.

In my opinion, it's best to start small and eventually work our way up to more complex measurements if they are necessary. I would lean towards caching and focusing on learning about outright blocks of snowflake first before moving on to performance measurements. We have some ongoing work to improve snowflake performance and assess this using onionperf instances from various vantage points. While OONI would be a great resource in measuring Snowflake performance on mobile networks, we have some lower hanging fruit that we'd like to learn first and this can be learned without a full Tor bootstrap:

Snowflake censorship attempts have started to pop up in recent months and it would be great to prioritize learning about where and how these are happening.
A significant part of the Snowflake performance cost is initiating the connection with a Snowflake proxy. Users on mobile networks are particularly prone to having restrictive NAT configurations and it can take up to a few minutes to be matched with a working Snowflake proxy, depending on availability. With timestamped log messages from our new event channel we can learn how long this process is taking.

It's also the case that most users will be using cached tor states. So performance measurements with the cached state will still be interesting from that perspective.

What is interesting, if we read the logcat is that tor says "Delaying directory fetches: No running bridges". If think this could mean that tor will try continuing the bootstrap at a later time. So, I think that after this message the bootstrap should be considered failed. Now, the obvious question to ask to Snowflake developers is whether this assumption is true.

Here's the line in question:

01-28 15:19:22.401 20606 20948 E GoLog   : Jan 28 15:19:22.000 [notice] Delaying directory fetches: No running bridges

This is not actually an error and shouldn't be related to the bootstrap problem. This is a side effect of the firewalling that snowflake does on its OR port. The bridge directory authority does an OR port reachbility test when bridges join the network, and if the OR port is reachable, it will assign a 'running' flag to it. We frequently firewall this port for bridges that we do not want to hand out over BridgeDB or to make them less susceptible to probing attacks. It shouldn't actually interfere with the functionality of the bridge, but it does cause core tor to print out these messages.

Mobile measurements, though, are extremely more problematic. Here's a table with results on Android:

[snip]

Interestingly, with caching enabled, I got these results:

The results of mobile clients with a full uncached tor bootstrap are surprising to me as well. I wouldn't have expected the difference between cached and uncached bootstraps to be this extreme. What version of tor are you using here? It's possible you're running into a bug where bootstraps will hang indefinitely if done without a bridge fingerprint. I'm not sure this is the issue but it's worth digging into a bit.

This result was quite puzzling to me, since I did not expect to see variability depending on the geographic location and I would have expected this person to see mostly timeouts like I did. (Should I have expected it?)

We have noticed a variation in performance due to geographic location and also due to the NAT/networking setup of the client. This is something we're still trying to understand and map out but yes we can expect there to be considerable varation between devices at the moment.

Now for the summary answers to your four questions:

Do these mobile performance with and without caching match your experience?

All of the results look reasonable and expected to me except the mobile uncached results. I think it worth doing some debugging and digging into that a bit more if you're willing.

Is it correct to say that after tor says "Delaying directory fetches: No running bridges" it's basically game over and the bootstrap will not converge until tor decides to try handshaking again? (And this until is certainly longer than the maximum time we're willing to wait for an interactive OONI experiment?)

No, see my comment above: this is an unrelated side effect of firewalling the OR port at the bridge.

Do you think we should be measuring by default using AMP or using the rendezvous mechanism? That is, which data point would be more useful to you? Should we do both together? Should we choose at random? (Of course, I think we should also include data about the mechanism being used in the measurement, otherwise it's pointless)

It would be really useful to us to do bootstraps using both the domain fronting method and the AMP cache method. We might add more rendezvous methods in the future and at that point it would be useful to test those as well!

Assuming the answer to question 1 is that these results we see are expected, what is the most useful measurement we can implement for you? Is it more useful to know that the Snowflake-assisted bootstrap times out often or is it more useful to know that we could bootstrap using Snowflake although the cache makes the bootstrap time more difficult to compare?

I would rank the usefulness of different measurements as follows:

Is the tested snowflake configuration blocked, i.e., can a (cached) tor bootstrap happen?
If it is blocked, where was it blocked? For example, was the client able to get an assigned proxy? Did the assigned proxies just not work? Did the client fail to connect to STUN servers even before the connection with the broker?
What are the time results for the various connection attempts? When did the client get assigned snowflake(s)? When did it successfully connect to the snowflakes?
Bootstrap connection time. As stated above, I think we should do cached bootstraps for now. These will be useful enough, and we have other ways of doing more full performance breakdowns. We can always change this later but for now cached will be great :)

Let me know if I can clarify anything more! It's exciting to see this all come together!

Hey! This is really amazing work, thanks for the detailed writeup. I'm excited for when the test results eventually come in and what it can tell us about Snowflake reachability! I have several comments and a summary of answers to your 4 enumerated questions below.

Thanks a lot for your detailed reply! 🙂

I'll reply inline to your comment and explain what changes we implemented thanks to the insights it provided.

Here's a quick summary of the most important points and still-open questions:

we (= OONI) should try and figure out why the uncached bootstrap performance was so poor
we have started writing code for using the "event channel" (thanks for that!) in torsf, we'll report back!
do I understand correctly that it this stage the most useful thing for torsf to do is to choose one of "amp" and "domain fronting" rendezvous at random, rather than always using "domain fronting"?

See below for more detailed answers.

Let us call rendezvous the current configuration [snip]. Three other configurations of torsf are possible.

This is just a slight nitpick on wording:

We've been using rendezvous method to refer to different ways of contacting the broker. The current configuration in Tor Browser and the default configuration you're presenting here uses domain fronting as the rendezvous method. AMP cache is an alternative rendezvous method. So I'd recommend the following naming schemes for the different configurations you've shown here:
* domain fronting

* amp

* domain fronting + cache

* amp + cache

Thanks for educating me about the correct terminology! I have updated my mental model, the implementation and the spec to use the suggested vocabulary.

(Since what the experiment actually does is setting tor's DataDirectory (which in turn, by default, contains the CacheDirectory, I have also changed the terminology to say datadir as opposed to saying cache.))

While there, to ease experimentation, I made it possible in https://github.com/ooni/probe-cli/pull/683 to select whether the use "amp" or "domain_fronting" and whether to enable/disable a "persistent data dir".

The defaults we're using now are, respectively, "domain fronting" and enabling a "persistent data dir", as you recommended. Though, the possibility of changing this values w/ settings opens up the possibility of running further experiments (among which, one to clarify why tor does not often bootstrap on mobile).

(we tested both "https://snowflake-broker.torproject.net.global.prod.fastly.net/" and "https://snowflake-broker.torproject.net/", which is the correct one? I suppose the first one for circumvention reasons, but maybe I'm missing something here?).

Just took a look at how you're using this. Your client configuration is essentially:
ClientConfig{
    BrokerURL: "https://snowflake-broker.torproject.net.global.prod.fastly.net/",
    FrontDomain: "cdn.sstatic.net",
}
This is correct and it's what Tor Browser is configured to use.

Thanks for ensuring that our config is correct!

The BrokerURL here isn't seen by the censor, it's included inside the TLS encrypted HTTP request to the front domain. The reason for using the fastly URL is to have traffic redirected to the right place. Our account at fastly has https://snowflake-broker.torproject.net.global.prod.fastly.net/ set up to forward traffic to https://snowflake-broker.torproject.net/.

Got it, thanks for clarifying how this works!

The cache is most likely if not certainly making the bootstrap faster because tor needs to fetch less data over the (bandwidth constrained?) Snowflake. The key question however is what are we measuring? Do we want to measure the total time tor takes to bootstrap from scratch when using Snowflake? Do we want to measure whether tor would bootstrap with Snowflake given a cache? When asking internally this question, we were conscious that choosing to use a cache will certainly be a problem in terms of making any statement regarding the bootstrap time.

This is a great question and you're right that caching on the client side will decrease the bootstrap time. It will also be hard to differentiate the measurements between first time clients and clients making cached connections.

In my opinion, it's best to start small and eventually work our way up to more complex measurements if they are necessary. I would lean towards caching and focusing on learning about outright blocks of snowflake first before moving on to performance measurements.

Thanks for helping us to choose the right approach here!

We have some ongoing work to improve snowflake performance and assess this using onionperf instances from various vantage points. While OONI would be a great resource in measuring Snowflake performance on mobile networks, we have some lower hanging fruit that we'd like to learn first and this can be learned without a full Tor bootstrap:
* Snowflake censorship attempts have started to pop up in recent months and it would be great to prioritize learning about where and how these are happening.

* A significant part of the Snowflake performance cost is initiating the connection with a Snowflake proxy. Users on mobile networks are particularly prone to having restrictive NAT configurations and it can take up to a few minutes to be matched with a working Snowflake proxy, depending on availability. With timestamped log messages from [our new event channel](https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/67) we can learn how long this process is taking.

Awesome to see the "event channel" being implemented! I have started sketching out how OONI could use this functionality to collect these events here https://github.com/ooni/probe-cli/pull/685. It's still incomplete, but I'll report back once I've finished coding and I am able to start using it. Thanks to you all for making this change!

(I'd rather not release a version of OONI pinning to a commit, though, so I'd rather wait for the "event channel" to appear inside a release before including this functionality into a OONI reelease.)

It's also the case that most users will be using cached tor states. So performance measurements with the cached state will still be interesting from that perspective.

Right, that's actually a good point in terms of capturing the average user experience! Didn't think about this!

What is interesting, if we read the logcat is that tor says "Delaying directory fetches: No running bridges". If think this could mean that tor will try continuing the bootstrap at a later time. So, I think that after this message the bootstrap should be considered failed. Now, the obvious question to ask to Snowflake developers is whether this assumption is true.

Here's the line in question:
01-28 15:19:22.401 20606 20948 E GoLog   : Jan 28 15:19:22.000 [notice] Delaying directory fetches: No running bridges
This is not actually an error and shouldn't be related to the bootstrap problem. This is a side effect of the firewalling that snowflake does on its OR port. The bridge directory authority does an OR port reachbility test when bridges join the network, and if the OR port is reachable, it will assign a 'running' flag to it. We frequently firewall this port for bridges that we do not want to hand out over BridgeDB or to make them less susceptible to probing attacks. It shouldn't actually interfere with the functionality of the bridge, but it does cause core tor to print out these messages.

Understood! Thank you for shedding light on the true meaning of such an error message!

Mobile measurements, though, are extremely more problematic. Here's a table with results on Android:

[snip]

Interestingly, with caching enabled, I got these results:

The results of mobile clients with a full uncached tor bootstrap are surprising to me as well. I wouldn't have expected the difference between cached and uncached bootstraps to be this extreme. What version of tor are you using here?

AFAICT from https://github.com/ooni/go-libtor's README, we're using tor@d06bcf7672, authored on 2021-11-08.

Judging from tor's tag history, this should be between tor 0.4.6.7 and tor 0.4.6.8.

It's possible you're running into a bug where bootstraps will hang indefinitely if done without a bridge fingerprint. I'm not sure this is the issue but it's worth digging into a bit.

Absolutely!

I think it may be worth it to upgrading to the latest stable version of tor.

This result was quite puzzling to me, since I did not expect to see variability depending on the geographic location and I would have expected this person to see mostly timeouts like I did. (Should I have expected it?)

We have noticed a variation in performance due to geographic location and also due to the NAT/networking setup of the client. This is something we're still trying to understand and map out but yes we can expect there to be considerable varation between devices at the moment.

Understood, thank you!

Now for the summary answers to your four questions:

Do these mobile performance with and without caching match your experience?

All of the results look reasonable and expected to me except the mobile uncached results. I think it worth doing some debugging and digging into that a bit more if you're willing.

Yes! Thanks a lot for confirming this was unexpected. I believe it's clear this is an oddity to look into.

Is it correct to say that after tor says "Delaying directory fetches: No running bridges" it's basically game over and the bootstrap will not converge until tor decides to try handshaking again? (And this until is certainly longer than the maximum time we're willing to wait for an interactive OONI experiment?)

No, see my comment above: this is an unrelated side effect of firewalling the OR port at the bridge.

🙏

Do you think we should be measuring by default using AMP or using the rendezvous mechanism? That is, which data point would be more useful to you? Should we do both together? Should we choose at random? (Of course, I think we should also include data about the mechanism being used in the measurement, otherwise it's pointless)

It would be really useful to us to do bootstraps using both the domain fronting method and the AMP cache method. We might add more rendezvous methods in the future and at that point it would be useful to test those as well!

On this note, would it be reasonable if we choose one bootstrap method at random, then? The improved implementation I added in https://github.com/ooni/probe-cli/pull/683 uses "domain fronting" by default, but perhaps it seems more useful to you Snowflake developers if we randomize the bootstrap type?

Assuming the answer to question 1 is that these results we see are expected, what is the most useful measurement we can implement for you? Is it more useful to know that the Snowflake-assisted bootstrap times out often or is it more useful to know that we could bootstrap using Snowflake although the cache makes the bootstrap time more difficult to compare?

I would rank the usefulness of different measurements as follows:

Is the tested snowflake configuration blocked, i.e., can a (cached) tor bootstrap happen?

If it is blocked, where was it blocked? For example, was the client able to get an assigned proxy? Did the assigned proxies just not work? Did the client fail to connect to STUN servers even before the connection with the broker?

What are the time results for the various connection attempts? When did the client get assigned snowflake(s)? When did it successfully connect to the snowflakes?

Bootstrap connection time. As stated above, I think we should do cached bootstraps for now. These will be useful enough, and we have other ways of doing more full performance breakdowns. We can always change this later but for now cached will be great :)

Thanks a lot for clearly ranking all the measurements by their utility, thanks super useful!

Let me know if I can clarify anything more! It's exciting to see this all come together!

No, everything was super clear, thanks!

do I understand correctly that it this stage the most useful thing for torsf to do is to choose one of "amp" and "domain fronting" rendezvous at random, rather than always using "domain fronting"?

Actually, I would say that the domain fronting option is the most useful. Right now AMP cache is just a backup and isn't recommended as a configuration anywhere. So if we have to make a choice, I'd only test domain fronting and we'll update if needed.

(I'd rather not release a version of OONI pinning to a commit, though, so I'd rather wait for the "event channel" to appear inside a release before including this functionality into a OONI reelease.)

Done! Should be in v2.1.0

do I understand correctly that it this stage the most useful thing for torsf to do is to choose one of "amp" and "domain fronting" rendezvous at random, rather than always using "domain fronting"?

Actually, I would say that the domain fronting option is the most useful. Right now AMP cache is just a backup and isn't recommended as a configuration anywhere. So if we have to make a choice, I'd only test domain fronting and we'll update if needed.

Awesome, thanks for clarifying!

(I'd rather not release a version of OONI pinning to a commit, though, so I'd rather wait for the "event channel" to appear inside a release before including this functionality into a OONI reelease.)

Done! Should be in v2.1.0

Thanks a lot! I've created a new issue for tracking this enhancement: https://github.com/ooni/probe/issues/2017

The other remaining open issue is to figure out the long bootstrap time on mobile w/o persistent datadir, for which I opened a new issue at https://github.com/ooni/probe/issues/2018.

ooni / probe