trezor / trezord-go

:smiling_imp: Trezor Communication Daemon (written in Go)
GNU Lesser General Public License v3.0
241 stars 146 forks source link

fix: close libusb config description memory leak #254

Closed aloisklink closed 1 year ago

aloisklink commented 2 years ago

The config descriptors from the libusb c-library are not being properly closed by the golang code.

On my PC, when testing with MetaMask, this causes an approximately 10 MB per hour memory leak. This memory leak is not detected via golang's pprof, as it's in the c-library.

Closing these config descriptors using defer seems to fix this memory leak in my testing.

This seems to fix #223 (at least for me!). trezord-go now seems to pretty stable at around 50 MB.

More info

In order to find this memory leak, I first added a pprof API call to trezord-go, which I used to get performance data on the golang heap and co-routine (let me know if you'd be interested in a PR for this too).

What I found was that according to pprof, there was no memory leak.

However, according to Linux, trezord-go was still leaking ~10MB/h, so I assumed the memory leak wasn't in golang, but in the underlying C libraries, so I tried running trezord-go through valgrind to see if I could spot a memory leak there.

valgrind --leak-check=full --show-leak-kinds=all --verbose ./trezord-go

After playing around with my Trezor a bit, I pressed CTRL+C and got the following results:

==905747== 80 bytes in 2 blocks are indirectly lost in loss record 301 of 566
==905747==    at 0x483B723: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==905747==    by 0x483E017: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==905747==    by 0x7F55C6: parse_interface (descriptor.c:206)
==905747==    by 0x7F55C6: parse_configuration (descriptor.c:439)
==905747==    by 0x7F55C6: raw_desc_to_config (descriptor.c:467)
==905747==    by 0x7F63A0: libusb_get_config_descriptor (descriptor.c:629)
==905747==    by 0x7FE1E2: _cgo_0328dca10925_Cfunc_libusb_get_config_descriptor (cgo-gcc-prolog:565)
==905747==    by 0x4641DF: runtime.asmcgocall (/usr/lib/go-1.13/src/runtime/asm_amd64.s:655)
==905747==    by 0x46074B: runtime.(*mheap).alloc.func1 (/usr/lib/go-1.13/src/runtime/mheap.go:1093)
==905747==    by 0x40543: ???
==905747==    by 0x439F3F: ??? (/usr/lib/go-1.13/src/runtime/proc.go:1080)
==905747==    by 0x462893: runtime.rt0_go (/usr/lib/go-1.13/src/runtime/asm_amd64.s:220)
...
==905747== 128 bytes in 2 blocks are still reachable in loss record 326 of 566
==905747==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==905747==    by 0x7F57C4: parse_interface (descriptor.c:298)
==905747==    by 0x7F57C4: parse_configuration (descriptor.c:439)
==905747==    by 0x7F57C4: raw_desc_to_config (descriptor.c:467)
==905747==    by 0x7F62A0: libusb_get_active_config_descriptor (descriptor.c:584)
==905747==    by 0x7FE12E: _cgo_0328dca10925_Cfunc_libusb_get_active_config_descriptor (cgo-gcc-prolog:504)
==905747==    by 0x4641DF: runtime.asmcgocall (/usr/lib/go-1.13/src/runtime/asm_amd64.s:655)
==905747==    by 0x46074B: runtime.(*mheap).alloc.func1 (/usr/lib/go-1.13/src/runtime/mheap.go:1093)
==905747==    by 0x12B7: ???
==905747==    by 0x439F3F: ??? (/usr/lib/go-1.13/src/runtime/proc.go:1080)
==905747==    by 0x462893: runtime.rt0_go (/usr/lib/go-1.13/src/runtime/asm_amd64.s:220)

There were a couple of other memory leaks reported, but all that I saw had the lines libusb_get_config_descriptor or libusb_get_active_config_descriptor in them.

Figure_1

Edit: Updated graph with more results

prusnak commented 2 years ago

Wow, that's some stellar investigation there! Thank you!

@tsusanka can we please test this in QA to see the change does not introduce any regressions?

tsusanka commented 2 years ago

Great stuff, Alois, thank you! I've passed this on to QA, they have other tasks at the moment, so this might take a while, but we'll definitely circle back.

vdovhanych commented 2 years ago

Could be related to the reports of suite using a lot of memory when left running. fe. https://github.com/trezor/trezor-suite/issues/3214

prusnak commented 2 years ago

trezor/trezor-suite#3214

That issue is unrelated to this one as can be seen here (trezord.exe only occupies 34,3 MB):

aloisklink commented 2 years ago

Wow, that's some stellar investigation there! Thank you! Great stuff, Alois, thank you!

No worries! Thanks for making this open-source so I had the chance to hunt down the leak myself! It was my first time looking at golang too, it does seem a lot nicer than C-code.

And no worries on slow QA. The memory leak is pretty slow, and with modern Windows forcing automatic reboots for updates, it probably only affects Linux users that don't turn off their PCs enough.

Let me know if you want me to add a CHANGELOG.md entry or anything; there's a minor typo in my commit title (should be descriptor, not description), so I don't mind editing my commit.

https://github.com/trezor/trezor-suite/issues/3214

That issue is unrelated to this one as can be seen here

Although https://github.com/trezor/trezor-suite/issues/3214 seems to mainly be about the suite, some of the comments look related:

bosomt commented 2 years ago

@aloisklink thank you for your help.

We already received new version for testing and I already tested 3 versions on Windows 10, macOS ARM M1 and KUbuntu 22.04. So far I have not found any problems with new version. Devices are detected, discovery is working as expected, transactions are signed.

@aloisklink can you please suggest any Linux tool that can track memory consumption ? So I can compare current public version and your version values, ideally with graph results.

And one more question. Was this mem leak Linux specific or it was present on all operating systems ?

aloisklink commented 2 years ago

@aloisklink can you please suggest any Linux tool that can track memory consumption ? So I can compare current public version and your version values, ideally with graph results.

I actually struggled to find one! Most of the stuff you find online (e.g. htop) only tracks memory consumption as a % of total RAM.

For the graph above, I used ps_mem, which can be installed with pip3 install ps_mem.

Then, I used to following bash loop to log memory usage into a CSV file. I had Metamask running in the background, and it seems to call OPTIONS /listen HTTP/1.1 and POST /listen HTTP/1.1 every 5 minutes, but I'm not sure if that's related. (maybe speeding up these requests will make the leak faster?)

while true; do
  # warning, only works if you started Trezord with ./trezord-go (without `sudo`)
  echo "$(date --iso-8601=seconds),$(ps_mem -p "$(pgrep trezord-go)" --total)" | tee --append mem-usage.csv
  sleep 15m
done

I then made a simple Python script to graph the CSV files:

graph.py memory leak graph script ```python import datetime import csv import numpy as np def load_data(csv_file): with open(csv_file) as file1: memleak = [{ "date": datetime.datetime.fromisoformat(x["datetime"]), "memory_usage": int(x["memory_usage"]) } for x in csv.DictReader(file1, fieldnames=("datetime", "memory_usage"))] data = { "date": np.array([x["date"] - memleak[0]["date"] + datetime.datetime.fromisoformat("1970-01-01T00:00:00") for x in memleak]), "memory_usage": [x["memory_usage"] for x in memleak], } return data memleak = load_data("test-2022-08-06T10:49:52+01:00/mem-usage.csv") # fixed = load_data("test-2022-07-29T04:23:20+01:00/mem-usage.csv") fixed = load_data("test-2022-07-30T23:49:05+01:00/mem-usage.csv") import matplotlib.pyplot as plt fig, axs = plt.subplots(1) axs.plot("date", "memory_usage", data=memleak, label="memleak") axs.plot("date", "memory_usage", data=fixed, label="fixed") axs.legend() import matplotlib.dates as mdates axs.xaxis_date() axs.xaxis.set_major_formatter(mdates.DateFormatter("Day %d")) axs.set_xlabel("Time (start time not accurate)") from matplotlib.ticker import EngFormatter formatter1 = EngFormatter(unit="B", places=1, sep="\N{THIN SPACE}") # U+2009 axs.yaxis.set_major_formatter(formatter1) axs.set_ylabel("trezord-go memory usage (from Linux)") plt.show() ```

Was this mem leak Linux specific or it was present on all operating systems ?

I only use Linux, so I can't confirm if this leak exists on MacOS/Windows. But from looking at the code, I think the memory leak is on all operating systems. Also, I found these comments from Windows/MacOS users that seem to indicate a memory leak in trezord:

My testing methodology

The leak is pretty small, ~10MB per hour for me, so you'd probably have to run tests overnight/over multiple days to spot it :)

My testing methodology was to start ./trezord-go, start logging memory usage with the above bash script, then open up Metamask and sign some stuff using my Trezor T, then leave Metamask open for days (I also increased the Auto-Lock Timer of Metamask to a big number, which is unsecure if you don't use a hardware wallet).

You should see some log entries every 5 minutes that look like 127.0.0.1 - - [25/Feb/2022:08:23:03 +0000] "OPTIONS /listen HTTP/1.1"

If you're testing on Windows, please make sure to temporarily disable Automatic Windows Updates so it's doesn't reboot your machine overnight (although on new versions of Windows, it's super complicated to disable!)

tsusanka commented 1 year ago

Preliminary testing has not found any issues with this whatsoever and it also seems like no memory is leaking 🎉 . We will continue testing this for a longer period of time, but we will merge it now as it seems good to go. We need to solve some build issues (https://github.com/trezor/trezord-go/pull/258), so we will test altogether then again before releasing.

Thank you for a great contribution!

aloisklink commented 1 year ago

Preliminary testing has not found any issues with this whatsoever and it also seems like no memory is leaking

Awesome sauce :rocket:

Still todo is write a CHANGELOG.md entry, but I'll let you guys figure out that out, since I don't know if this is a Windows/MacOS issue too!

Thank you for a great contribution!

And thank you for supporting open-source and for supporting Linux as an OS :smile:

STew790 commented 1 year ago

QA OK We tested ver 31 and 32 on two separate notebooks with Linux running for about 2 days and no visible memory leak was found. Also as mentioned above (https://github.com/trezor/trezord-go/pull/254#issuecomment-1235351785), we are using the ver 32 on our testing machines (for almost a month now) and no memory leaks were found as well.

Thanks for the contribution! :rocket: