status-im / status-go

The Status module that consumes go-ethereum
https://status.im
Mozilla Public License 2.0
724 stars 244 forks source link

Investigate temperature impact factors during blockchain sync #432

Closed divan closed 5 years ago

divan commented 6 years ago

Problem

Initial blockchain synchronization is known to be CPU expensive and causes phone (over)heating. This is very undesirable, as immediately turns users away from using the app and drains battery.

We need to find a way to trade off this high CPU usage during initial synchronization to something else - longer sync time or slower disc I/O, for example. As a minimum, we want to reduce CPU usage to the sane defaults, as a maximum - be able to reason about exact numbers and configure this tradeoff dynamically (user with latest model phone and charging would prefer "fast and CPU expensive sync", whilst user with older phone and on battery would prefer "slower and CPU friendly sync".

Implementation

As blockchain synchronization code resides in ethereum-go, we need to profile the sync process and figure out what CPU-bound parts can be safely traded for longer time, and possible outcomes/drawbacks from it.

Acceptance Criteria

Random notes

Temperature is believed to be strongly linked with CPU usage, but as the final acceptance criteria is still a low temperature during sync, it make sense to measure it (using the same phone between two app releases). Instead of using manual temperature testing with hand, one can use cheap infrared cameras like Flir One to measure exact temperature :)

tiabc commented 6 years ago

I think it's not an easily achievable thing and the scope of this issue is considered really huge.

May we reduce it down to just figuring out which specific sections of go-ethereum code most impact cpu utilization and thus phone heating? Not only may it be cpu, by the way. go-ethereum stores everything on disk and disk operations may also have some significant impact as well.

The idea is to identify such places and understand where we move next, creating new issues and parallelising work. Does it make sense?

adambabik commented 6 years ago

LES/2 might be helpful to reduce CPU usage. go-ethereum team merged the first PR related to it just a few days ago: https://github.com/ethereum/go-ethereum/pull/14970

divan commented 6 years ago

@tiabc yes, let's narrow this issue to investigation only. Renamed the issue.

3esmit commented 6 years ago

I also suffer this issue on Galaxy Note 4

divan commented 6 years ago

I started with naïve assumption of CPU–temperature link and did some profiling on the laptop. It quickly showed that most of the CPU time during blockchain sync is spent in writing blocks into leveldb in InsertHeaderChain() function.

I introduced small delay in this call — effectively making this function sleep before executing expensive write:

+    time.Sleep(100 * time.Microsecond)
     if err := writeHeader(header); err != nil {
         return i, err
     }

Then I measured CPU usage and total time of syncronizing from scratch.

Here are some preliminary results from my laptop (it has 4 cores, so 400% CPU usage is max):

Delay Mean CPU Time to sync
0 229% 88.52s
50μs 176% 153.98s
100μs 153% 163.13s
150μs 125% 222.04s
200μs 129% 354.77s
250μs 134% 461.80s
300μs 122% 533.14s
350μs 110% 570.09s
400μs 93% 526.46s
450μs 83% 503.66s
500μs 83% 570.26s
550μs 78% 589.08s
600μs 77% 616.33s
1ms 58% 629.36s

3d

Looks like reverse dependency is pretty clear here, but I'm going to collect more data to find exact function.

Here are detailed graphs of CPU usage for three cases - no delay, 100μs and 1ms: no delay 100μs 1ms


Next steps


Code used for collecting data and make graphs: https://gist.github.com/divan/55a55240ea553ae47517733f9fc24a48

themue commented 6 years ago

Great work. I'm current following your pprof label hint of yesterday.

adambabik commented 6 years ago

Graphs! I love them, great job!

The first thing I noticed is that it can consume more than 100% of CPU which indicates it uses multiple cores. Could you try to change runtime.GOMAXPROCS(runtime.NumCPU()) to runtime.GOMAXPROCS(1) in cmd/statusd/main.go and run your computations? Hopefully, we could set it to 1 during syncing and switch to runtime.NumCPU() after to decrease CPU usage. I hope it won't negatively impact app performance...

Also, have you tried with the newest develop? We have pretty new CHT root and it should be much quicker than 88.52s.

divan commented 6 years ago

@adambabik yes, it uses 4 cores on my machine, so maximum CPU value is 400%.

Interesting suggestion, I'll try to play with it as well. Still, I guess that "processing and writing to the disc 2048 block headers as fast as possible" is the major CPU burning factor here.

I ran these tests on feature/cli-api branch, so it lacks your CHT update, but that's even better, because this timeframe looks pretty good for statistically significant measurements of delay impact.

I ran more measurements, inserting delays in range from 50 to 650 microseconds with 50μs step, updated the original comment, and here are two 2D charts (the same data as 3D chart above):

delay/time delay/cpu

0xc1c4da commented 6 years ago

nice work @divan I'm curious if we know CPU usage looks like under high traffic Whisper usage? But perhaps this is better suited for another issue

divan commented 6 years ago

@jarradh no, we don't. I don't think we even simulated high traffic scenario yet. But I do have some ideas how to simulate it, and yes, that'd be a separate issue.

adambabik commented 6 years ago

To simulate this, we need an account that sends a lot of Whisper messages one after another. It would be great to collect some data with different PoW values because we currently use 0.001 and messages with that low PoW can be passed around only by our geth nodes run with custom settings. I believe we'd like to bump it to at least 0.2 (which is a default valye) so that most of geth nodes could pass our messages around, right? Interesting how that would affect the CPU usage.

status-github-bot[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

status-github-bot[bot] commented 5 years ago

This issue has been automatically closed. Please re-open if this issue is important to you.