oskarth commented 4 years ago

Problem

As a user with a limited data plan, I want the bandwidth usage to be substantially lower, so that I can use Status on 3G/4G with a limited data plan.

Details

App version: 0.13.1 (2019080817)
OS: ios
Node version: v.0.25.0-beta.0
Mailserver: mail-03.gc-us-central1-a.eth.beta

Using iOS, current period under Cellular data says the following for similar apps:

Status 14.8 GB
Telegram 1.8 GB
Line 884 MB
Fastmail 210 MB
Signal 203 MB

Note that all apps outside of Status have attachments and images in them.

To calibrate for usage, here are the corresponding numbers for Screentime the last 7 days:

Telegram 2h 29m
Status 1h 18m
Line 1h 8m
Fastmail 27m
Signal 11m

Note that in Line and Signal I'm not in any public channels, but in Telegram I'm in several that are a lot more noisy than Status.

Comparing to Telegram, Line and Signal, this means we currently consume 10-20x more bandwidth, without attachments. As a user, this is an unacceptable user experience.

Implementation

As a somewhat representative user, but with a limited data plan I care more about this than cover traffic/metadata protection.

Acceptance Criteria

Bandwidth usage reduced 10-20x so that it is within a factor or three of comparable apps, like Telegram, Line and Signal.

Notes

In light of the current financial situation, timeline, and growing the core app user base, it might be the case that we partition the problem in two:

a) Continue long-term 'fundamental' research in conjunction with other projects to develop a better alternative (Block.Science/Swarm/Nym/libp2p) b) bandaid to help with adoption and traction of Status the app, w/o as strong metadata/decentralization guarantees, a la Infura-for-chat (basically what we have already with mailserver)

Side note: I also searched for 'bandwidth' in open issues and couldn't find a relevant one, which is a bit surprising given that it's a very common user complaint, anecdotally. User feedback not making its way into concrete problem descriptions? cc @rachelhamlin @hesterbruikman

Future Steps

Replace Whisper.

oskarth commented 4 years ago

^ fyi @status-im/status-core @rachelhamlin @cammellos @jakubgs

cammellos commented 4 years ago

@oskarth which version of the app are you running? could you please add it to the description

oskarth commented 4 years ago

@cammellos done

rachelhamlin commented 4 years ago

Side note: I also searched for 'bandwidth' in open issues and couldn't find a relevant one, which is a bit surprising given that it's a very common user complaint, anecdotally.

It's actually on my mind, but not captured, you're right @oskarth. To justify that slightly, we haven't had mental bandwidth to focus on much outside of SNT utility, multi-account, keycard and bugs this year—until now. So bandwidth will be a topic during our Oct 15 planning session (discuss post TK today).

User feedback not making its way into concrete problem descriptions?

The issue of capturing user feedback is something that I very much hope to prioritize now that @andremedeiros is coming onboard to help with the dev process.

b) bandaid to help with adoption and traction of Status the app, w/o as strong metadata/decentralization guarantees, a la Infura-for-chat (basically what we have already with mailserver)

What kind of sacrifice are we willing to make here? Let's discuss in janitors.

oskarth commented 4 years ago

It's actually on my mind, but not captured, you're right @oskarth. To justify that slightly, we haven't had mental bandwidth to focus on much outside of SNT utility, multi-account, keycard and bugs this year—until now. So bandwidth will be a topic during our Oct 15 planning session (discuss post TK today).

Yeah that's fair, I think it's a larger systemic issue though, as user's feedback doesn't make its way to GHIs. Perhaps cause it is too intimidating? Or they give feedback in other forums and then there's a lack of follow up? Something re community link missing here, not quite sure what. cc @jonathanbarker @j-zerah FYI.

What kind of sacrifice are we willing to make here? Let's discuss in janitors.

I'd like this to be an open discussion, but we can bring it up there as well.

cammellos commented 4 years ago

There's also to note that the version used for bandwidth is currently still listening to the old shared topic, which will be disabled for v1. From the bandwidth tests https://docs.google.com/spreadsheets/d/13kffxZaPnvULoy5Qh5sZSI2551KusCLJWUcdhKyobkE/edit#gid=0 , that version is ~6 times more bandwidth hungry than v1 (15 MB vs 94), although the benchmark are to be taken with a pinch of salt. Currently working in having them automated, so we can better tune them and record them.

andremedeiros commented 4 years ago

Currently working in having them automated, so we can better tune them and record them.

Commendable effort, @cammellos! What does this involve and how hard would it be to get this to run as part of the test suite?

hesterbruikman commented 4 years ago

Thought on the UI side: We already implemented 'fetch messages' to allow for more user controlled bandwidth use. Could expand this to channels (cc @errorists ). That is after exploring other options to save bandwidth while retaining all functionality.

Regarding feedbacak will check in #statusphere / ambassadors channel to see if they reecognize the issue. As reliability and notifications, which are crucial for on the go use, have developed a bad rep it could also be that the majority of our user base is relying on wifi/in home experimentation with Status. Just a theory.

cammellos commented 4 years ago

@andremedeiros We are going only to test status-protocol-go, as automating the testing of status-react is much harder (so we won't be testing mailserver interactions until that code is ported).

The strategy that I am following is to have two clients (or more, for now just two), interact with each other for a specified amount of time/messages. Both client will be dockerized and run through docker-compose, at the end of the tests metrics for each container can be collected with docker stats.

We can probably easily get it in the test suite (status-protocol-go), the only dependencies would be docker/compose and golang, it's would take some more time to make it a red/green test (it's more of a benchmark), as well as we don't have isolated network conditions for now, so it also depends on the overall traffic of the network, but we can take that into account when measuring.

andremedeiros commented 4 years ago

That makes perfect sense, @cammellos. Thank you.

J12B commented 4 years ago

Or they give feedback in other forums and then there's a lack of follow up? Something re community link missing here, not quite sure what. cc @jonathanbarker @j-zerah FYI.

re: capturing user/community feedback - have we considered an "engagement survey" type mechanism for our main community users? Similar to one for core contributors, but focused on feedback they have for Status products, features, etc

hesterbruikman commented 4 years ago

We have had those in the past and it surely is time to bring them back! It was always a bit more of a one off, never a solid mechanism that better balances effort-output.

Regarding bandwidth, a quick poll in #statusphere brought no alarming response by 3 active community member/contributors. All estimating a monthly 1GB going to Status. Not to say that it's not a problem:)

corpetty commented 4 years ago

one plan: have a separate type of public chat that isn't as private (with respect to traffic analysis). You could just have it as an option and have some UI element that notes what kind of public channel it is.

This could also be the default and then we can set up relays that allows people to communicate in these at bandwidth costs (similar to @jakubgs bridge)

cammellos commented 4 years ago

Here's a rough tool to check for bandwidth: https://github.com/status-im/status-protocol-go-bandwidth-test

one plan: have a separate type of public chat that isn't as private (with respect to traffic analysis). You could just have it as an option and have some UI element that notes what kind of public channel it is.

The issue is not due to public chats I believe ( we have high usage even without joining any public chat from the previous bandwidth tests ), it's mainly due to discovery topics, as you receive messages not sent to you, while in public chat is the chance is lower (there's a chance that your bloom filter matches some other topic, but it's probably not huge), so unless we completely bypass whisper not sure we can optimize those much, but worth having a look.

corpetty commented 4 years ago

have we ever tried playing with a dynamic tuning of the bloom filters based on user preference? It's basically a sliding scale of how much you poll for based on how much information you want to give the server your asking. If a user doesn't care about that, then they can at least minimize the amount of "extra stuff" they're getting.

cammellos commented 4 years ago

That's a fairly big chance, it means we would be fundamentally changing how whisper works, basically say you don't even use a bloom filter, but you pass just a list of topics (provides no darkness, but best bandwidth), you still have an issue with the shared topic (currently each user is assigned to a random bucket based on the pk, n = 5000).

We also have a personal topic, that can be used instead of the partitioned, which is the user pk, but at that point any darkness is gone, so makes little sense to use whisper.

I think we need to understand a bit better what's the consumption is coming from, is it coming from extra messages that you don't care about? Is it coming from the fact that you receive multiple copies of each message? or is it just whisper overhead? etc

Once we understand better the dynamics we can see what we can do and where is best to optimize imo

corpetty commented 4 years ago

thoroughly agreed.

adambabik commented 4 years ago

I think we need to understand a bit better what's the consumption is coming from, is it coming from extra messages that you don't care about? Is it coming from the fact that you receive multiple copies of each message? or is it just whisper overhead? etc

We have quite granular Whisper metrics that can answer most of these questions: https://github.com/status-im/whisper/blob/master/whisperv6/metrics.go. For example, we have envelopeAddedCounter and envelopeNewAddedCounter which difference can tell how many duplicates we receive. Or envelopeErrNoBloomMatchCounter which tells about the number of messages not matching bloom filter.

What we would need to do is to expose them in the app because as far as I know they are used exclusively for statusd running on our servers.

Many open source projects, like Firefox, collects stats and sends them to the centralized servers only if a user agrees to do so. Maybe we can have a similar strategy? It should be opt-in of course.

jakubgs commented 4 years ago

We had a short meeting today about the badwidth testing and I've noted down some things:

Knowledege

The status-protocol-go-bandwidth-test by Andrea can already do some basic tests
- Runs a separate processes in Docker
- Can control duration, numbers of messages, numbers of peers
What we want is a way to run these kinds of tests and generate reports
We want an ability to compare those results across time to look for improvement or regressions
We are not including traffic to and from mailservers because that part is Clojure-only for now
Making traffic realisitic could be nice, but the real thing we care about is bandwith used

Tasks

Add more granular metrics for status-protocol-go and whisper to measure:
- Message about drop rate & delivery success
- Noise from miss-deliveries
- Ratios of 1-to-1, public, and private group messages
Find a reporting format, preferably fed to Prometheus
- This will require lower cardinality, so aggregating message metrics
Run tests for various volumes to check complexity
Run tests periodically to measure improvements/regressions

I will start working on those probably next week, as I have to finish some other stuff.

adambabik commented 4 years ago

Find a reporting format, preferably fed to Prometheus

Not sure I would recommend pull-based tools for load testing. Unless these load tests will be fairly long. Also, Prometheus due to getting data periodically can miss some fluctuations which might be interesting for us. Maybe writing to InfluxDB? Having all data points can be also an advantage.

jakubgs commented 4 years ago

I did consider InfluxDB too, we can see what works better. I'd agree that a push rather than pull scheme would work better for benchmarks.

oskarth commented 4 years ago

Discuss post: https://discuss.status.im/t/fixing-whisper-for-great-profit/1419 Theoretical model numbers: https://htmlpreview.github.io/?https://github.com/vacp2p/research/blob/master/whisper_scalability/report.html Waku mode draft: https://github.com/status-im/specs/pull/54

@jakubgs any luck with above?

Also it'd be great if we can figure out where other traffic might be coming from, i.e. things that aren't captured by above model. For example, I remember some benchmark saying we spend 20% of traffic on Infura, which seems insane but makes sense given lack of transaction indexing (?). This means it is might become the bottleneck with Waku mode in place, which would hint at using attacking the indexing problem, e.g. with something algorithmic like @yenda was working on, or indexing a la thegraph that @bgits suggested

yenda commented 4 years ago

@oskarth on a new account there would only be a handful of calls for infura afaik. The heavy stuff is only when there is transactions to recover

jakubgs commented 4 years ago

Here's an update on the current state of my work on this:

Metrics

Add more granular metrics for status-protocol-go and whisper to measure

https://github.com/status-im/status-go/pull/1648 - Re-implementing our own metrics by using Prometheus client library.
status-im/status-react#9270 - Researching missing status-go version from the App
status-im/status-react#9262 - Attempt at fixing lack of status-go version

Currently I'm not sure how to fix the version issue, it should be fixed in status-go, but to figure out how to do that correctly I'll have to talk to Adam.

Storage

Find a reporting format, preferably fed to Prometheus

https://github.com/status-im/infra-ci/issues/10 - Researching use of InfluxDB for benchmarking metrics
- If we use it we'd need some kind of abstraction layer above both InfluxDB and Prometheus

I also looked at pushgateway for Prometheus as an alternative, but that still is dependent on Prometheus pull rate/interval and would not represent the real time creation of the metrics generated by the benchmark.

Orchestration

Run tests for various volumes to check complexity Run tests periodically to measure improvements/regressions

After investigating the status-protocol-go-bandwidth-test package by Andrea I don't think there's anything wrong with his simple approach of just spawning the processes with his run.sh script. Though it might be a bit nicer if we used something like Supervisord using the numprocs setting or systemd using instantiated services to orchestrate multiple processes in a more manageable way.

jakubgs commented 4 years ago

According to Adam the best way to collect these metrics would be to subscribe to the envelopeFeed: https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/whisper.go#L178-L182 And listen for the EventEnvelopeReceived event: https://github.com/status-im/whisper/blob/39d4d0a14f/whisperv6/events.go#L19 Which would allow me to collect envelope metrics(size, numbers) in InfluxDB without having to modify the whisper repo itself.

jakubgs commented 4 years ago

I've added a Topic attribute to Envelope in https://github.com/status-im/whisper/pull/38.

jakubgs commented 4 years ago

According to Network Metrics section of Docker docs:

Network metrics are not exposed directly by control groups. There is a good explanation for that: network interfaces exist within the context of network namespaces. The kernel could probably accumulate metrics about packets and bytes sent and received by a group of processes, but those metrics wouldn’t be very useful. You want per-interface metrics (because traffic happening on the local lo interface doesn’t really count). But since processes in a single cgroup can belong to multiple network namespaces, those metrics would be harder to interpret: multiple network namespaces means multiple lo interfaces, potentially multiple eth0 interfaces, etc.; so this is why there is no easy way to gather network metrics with control groups.

So as an alternative they propose creating IP Tables rules:

IPtables (or rather, the netfilter framework for which iptables is just an interface) can do some serious accounting. For instance, you can setup a rule to account for the outbound HTTP traffic on a web server:
$ iptables -I OUTPUT -p tcp --sport 80
There is no -j or -g flag, so the rule just counts matched packets and goes to the following rule.

Later, you can check the values of the counters, with:
$ iptables -nxvL OUTPUT

Which can be an issue since access to IP Tables requires root priviledges.

jakubgs commented 4 years ago

An alternative are "Interface-Level Counters":

Since each container has a virtual Ethernet interface, you might want to check directly the TX and RX counters of this interface.

It just requires some juggling to get the data. Assuming that $CID is ID of a container we want:

TASKS=/sys/fs/cgroup/devices/docker/$CID/tasks
PID=$(head -n 1 $TASKS)
mkdir -p /var/run/netns
ln -sf /proc/$PID/ns/net /var/run/netns/$CID
ip netns exec $CID netstat -i

Which should get use output like this:

admin@mail-01.do-ams3.eth.test:~ % sudo ip netns exec $CID netstat -i 
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0      1500   365190      0      0 0        305317      0      0      0 BMRU
lo       65536        0      0      0 0             0      0      0      0 LRU

Which gives us things like:

(RX|TX)-OK - Packets received/sent correctly.
(RX|TX)-ERR - Packets received/sent but with incorrect checksum.
(RX|TX)-DRP - Packets dropped because of full buffer.
(RX|TX)-OVR - Packets dropped due to exceeding TTL or other timing reason.

Now, packets are nice and all but we don't know their sizes, so that doesn't give us actual bandwidth.

status-im / status-mobile

Bandwidth usage way too high for limited data plan #9081

Problem

Details

Implementation

Acceptance Criteria

Notes

Future Steps

Knowledege

Tasks

Metrics

Storage

Orchestration