kolide / launcher

Osquery launcher, autoupdater, and packager
https://kolide.com/launcher
Other
505 stars 100 forks source link

USB hub / sound issues while launcher is running (Ubuntu) #1300

Open sechmann opened 1 year ago

sechmann commented 1 year ago
1. What version of launcher are you using (launcher --version)?
    % /usr/local/kolide-k2/bin/launcher --version
    {"caller":"main.go:30","msg":"Launcher starting up","revision":"7948f44625ae0dbbae1f278d652c7f53b8d79e8b","severity":"info","ts":"2023-08-24T08:02:57.30863322Z","version":"1.0.8"}
    launcher - version 1.0.8
      branch:       HEAD
      revision:     7948f44625ae0dbbae1f278d652c7f53b8d79e8b
      build date:   2023-05-08
      build user:    (runner)
      go version:   go1.20.4

2. What operating system are you using?
    Ubuntu 23.04
3. What did you do?
    Installed / ran the launcher
4. What did you expect to happen?
    Nothing
5. What happened instead?
    See below

I don't really know how these issues relate to each other, but i've experienced this on 2 different (Lenovo) laptops now.

After i install the Kolide launcher i immediately start experiencing issues with an external USB sound card (connected through a USB dock).

dmesg spits out the following lines each time the audio is crackling.

[186469.997814] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[186469.997820] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000127187e00 trb-start 0000000127187e10 trb-end 0000000127187e10 seg-start 0000000127187000 seg-end 0000000127187ff0
[186470.164813] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[186470.164820] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000127188880 trb-start 0000000127188890 trb-end 0000000127188890 seg-start 0000000127188000 seg-end 0000000127188ff0
[186470.289827] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[186470.289833] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000127187060 trb-start 0000000127187070 trb-end 0000000127187070 seg-start 0000000127187000 seg-end 0000000127187ff0

sudo systemctl stop launcher.kolide-k2.service solves the issue immediately.

Plugging the USB sound card directly to the laptop also solves the issue, however there seems to be something weird with USB docks + Kolide launcher.

directionless commented 1 year ago

That's pretty weird all right. I'm hard pressed to understand how it's happening, but I don't know much about the underlying linux usb systems.

If you're up for experimenting, launcher runs a couple of commands to determine what users to launch desktop menus as. I wonder if they're doing stuff that gets in the way. Launcher does something like:

  1. For each user
  2. loginctl show-user $uid --value --property=Sessions
  3. for each session
  4. loginctl show-session $session --value --property=Type
sechmann commented 1 year ago

Agreed on the weirdness :sweat_smile:. I'm up for experimenting.

The commands you're mentioning give the following output (and running them does not cause any sound issues, if that was what you were wondering).

% for u in $(cat /etc/passwd | cut -d ':' -f 3); do
        for s in $(loginctl show-user $u --value --property=Sessions); do
                echo "user ($u) session ($s): $(loginctl show-session "$s" --value --property=Type)"
        done
done
Failed to get user: User ID 0 is not logged in or lingering
Failed to get user: User ID 1 is not logged in or lingering
Failed to get user: User ID 2 is not logged in or lingering
Failed to get user: User ID 3 is not logged in or lingering
Failed to get user: User ID 4 is not logged in or lingering
Failed to get user: User ID 5 is not logged in or lingering
Failed to get user: User ID 6 is not logged in or lingering
Failed to get user: User ID 7 is not logged in or lingering
Failed to get user: User ID 8 is not logged in or lingering
Failed to get user: User ID 9 is not logged in or lingering
Failed to get user: User ID 10 is not logged in or lingering
Failed to get user: User ID 13 is not logged in or lingering
Failed to get user: User ID 33 is not logged in or lingering
Failed to get user: User ID 34 is not logged in or lingering
Failed to get user: User ID 38 is not logged in or lingering
Failed to get user: User ID 39 is not logged in or lingering
Failed to get user: User ID 42 is not logged in or lingering
Failed to get user: User ID 65534 is not logged in or lingering
Failed to get user: User ID 998 is not logged in or lingering
Failed to get user: User ID 997 is not logged in or lingering
Failed to get user: User ID 100 is not logged in or lingering
Failed to get user: User ID 101 is not logged in or lingering
Failed to get user: User ID 996 is not logged in or lingering
Failed to get user: User ID 102 is not logged in or lingering
Failed to get user: User ID 103 is not logged in or lingering
Failed to get user: User ID 995 is not logged in or lingering
Failed to get user: User ID 104 is not logged in or lingering
Failed to get user: User ID 105 is not logged in or lingering
Failed to get user: User ID 106 is not logged in or lingering
Failed to get user: User ID 107 is not logged in or lingering
Failed to get user: User ID 108 is not logged in or lingering
Failed to get user: User ID 109 is not logged in or lingering
Failed to get user: User ID 110 is not logged in or lingering
Failed to get user: User ID 111 is not logged in or lingering
Failed to get user: User ID 112 is not logged in or lingering
Failed to get user: User ID 113 is not logged in or lingering
Failed to get user: User ID 114 is not logged in or lingering
Failed to get user: User ID 115 is not logged in or lingering
Failed to get user: User ID 116 is not logged in or lingering
Failed to get user: User ID 117 is not logged in or lingering
Failed to get user: User ID 118 is not logged in or lingering
Failed to get user: User ID 119 is not logged in or lingering
Failed to get user: User ID 120 is not logged in or lingering
Failed to get user: User ID 121 is not logged in or lingering
Failed to get user: User ID 122 is not logged in or lingering
Failed to get user: User ID 123 is not logged in or lingering
user (1000) session (2): wayland
sechmann commented 1 year ago

I tried enabling debug logging to see if i could correlate the sound issues with any specific event from the launcher logs.

After some minutes of tailing the log there seems to always be a crackling noise preceeding the PublishLogs method (by the amount of time the log says it took).

(don't mind the time difference in the logs - dmesg get's out of sync when laptop is suspended)

i hear a crack, dmesg:
[Sat Aug 26 02:21:41 2023] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[Sat Aug 26 02:21:41 2023] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000123d9b280 trb-start 0000000123d9b290 trb-end 0000000123d9b290 seg-start 0000000123d9b000 seg-end 0000000123d9bff0
[Sat Aug 26 02:22:41 2023] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[Sat Aug 26 02:22:41 2023] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000123d9ba60 trb-start 0000000123d9ba70 trb-end 0000000123d9ba70 seg-start 0000000123d9b000 seg-end 0000000123d9bff0
[Sat Aug 26 02:22:41 2023] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[Sat Aug 26 02:22:41 2023] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000123d9c740 trb-start 0000000123d9c750 trb-end 0000000123d9c750 seg-start 0000000123d9c000 seg-end 0000000123d9cff0

about 1 sec later i see this in the launcher log:
Aug 25 09:40:58 vegar-t14 launcher[873523]: {"caller":"teelogger.go:25","err":null,"errcode":"","logType":"string","log_count":1,"message":"","method":"PublishLogs","reauth":false,"session_pid":873523,"severity":"debug","took":"686.816355ms","ts":"2023-08-25T07:40:58.786047513Z","uuid":"-snip just in case-"}
I hear another crack, dmesg:
[Sat Aug 26 02:21:41 2023] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[Sat Aug 26 02:21:41 2023] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000123d9b280 trb-start 0000000123d9b290 trb-end 0000000123d9b290 seg-start 0000000123d9b000 seg-end 0000000123d9bff0
[Sat Aug 26 02:22:41 2023] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[Sat Aug 26 02:22:41 2023] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000123d9ba60 trb-start 0000000123d9ba70 trb-end 0000000123d9ba70 seg-start 0000000123d9b000 seg-end 0000000123d9bff0
[Sat Aug 26 02:22:41 2023] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[Sat Aug 26 02:22:41 2023] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000123d9c740 trb-start 0000000123d9c750 trb-end 0000000123d9c750 seg-start 0000000123d9c000 seg-end 0000000123d9cff0
[Sat Aug 26 02:23:41 2023] xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 5 comp_code 1
[Sat Aug 26 02:23:41 2023] xhci_hcd 0000:00:14.0: Looking for event-dma 0000000123d9c100 trb-start 0000000123d9c110 trb-end 0000000123d9c110 seg-start 0000000123d9c000 seg-end 0000000123d9cff0

3 seconds later this pops up in launcher log:
Aug 25 09:42:01 vegar-t14 launcher[873523]: {"caller":"teelogger.go:25","err":null,"errcode":"","logType":"string","log_count":1,"message":"","method":"PublishLogs","reauth":false,"session_pid":873523,"severity":"debug","took":"3.172595693s","ts":"2023-08-25T07:42:01.278641556Z","uuid":"-snip just in case-"}

The issues can also (rarely) happen with no logs being printed from the kolide launcher. However they stop completely when i stop the launcher so I think it's at least correlated.

directionless commented 1 year ago

Hrm. Mostly I'm guessing based on what feels like it might be right.

Under the hood, launcher is running those loginctl commands every few seconds. Can you wrap your for loop with something that runs it ever 2 seconds?

Log sending is interesting. There are a couple steps in there:

  1. Osquery generates logs
  2. and sends them over a thrift socket to launcher
  3. launcher writes them into a boltdb
  4. another go routine reads them out of boltdb
  5. batches them
  6. and sends them to our SaaS

It's a big of a hack, and I would not suggest running this way, but thinking about how to test things... You adjust the interval launcher tries to send logs at. This is the logging_interval option. I don't remember what the default is, but it should accept anything between 5 seconds and 10 minutes.

sechmann commented 1 year ago

Running the loginctl in a loop does not produce any issues here.

I also tried running it with GNU parallel (spawning 100 processes) and 1000 iterations per loop. In other words i ran the loop 100 000 times. System load at 55 and all 12 cores at 100% for 15 minutes - no sound issues at all.

Tried a few different logging_interval settings. Setting it to 10m made the sound cracks less frequent, and they always happened when logging occurred.

Setting it to 10s made the sound cracks more frequent, but they did not always happen at the same time as logging. Also, quite often i got no logs for several minutes (during that time cracking sounds still occurred). The RequestConfig method however seems to always produce a large(r) crackling noise.

directionless commented 1 year ago

Wow, that sounds like a pretty solid test of the loginctl theory. Thank you

I'm at a loss, since the code path in question isn't near any kind of dbus or linux things. It's got socket activity, network activity, CPU, and disk. But linux should handle those... I suspect we'll need to try to reproduce it and poke some things.

You mentioned Ubuntu 23.04, I assume it's current on the packages? Which kernel package?

Were you running launcher prior to 23.04?

sechmann commented 1 year ago

Were you running launcher prior to 23.04?

Yes

I am currently experiencing the same sound cracks on a Dell laptop with Ubuntu 22.04, using the same USB hub and sound card. Both systems are current on the packages.

Kernels in use: 23.04:

% uname -r
6.2.0-31-generic

22.04:

% uname -r
6.0.0-1012-oem
trapgate commented 11 months ago

I'm experiencing this problem too, on a Lenovo Carbon X1 running a current version of Arch. The launcher version is slightly newer:

launcher - version 1.1.2
  branch:   HEAD
  revision:     ebbf392d7867bc8427840b8d5de9ccf629c8647e
  build date:   2023-10-04
  build user:   (runner)
  go version:   go1.21.1

Like the original reporter, there's only a single user logged in, and it's a wayland session. I'm also running pipewire, which may be different from the current state of the audio stack on Ubuntu. I don't see any 'dmesg' output when the crackling occurs.

I was guessing that the crackling might be caused by buffer underruns on the audio device, so I tried increasing the buffer size using pw-metadata -n settings $dev-id clock.force-quantum $increasingly-large-values, but this didn't have any effect on the crackling.

RebeccaMahany commented 11 months ago

Just a quick update that I haven't been able to reproduce this yet -- neither the crackling nor the corresponding dmesg logs. I'm on a HP laptop running Ubuntu 23.04 (kernel is 6.2.0-34-generic), using a logitech H340 USB headset, since we'd previously seen this issue or similar reported with that model of headset plugged directly into the laptop. Going to try to pick up a USB hub + sound card next to see if that lets me reproduce the issue.

directionless commented 11 months ago

The code relating to log sending is pretty simple. It starts around https://github.com/kolide/launcher/blob/59ad0a639927f3cb838188b9d92bc381f0836566/pkg/osquery/extension.go#L647

Start a bboltdb transaction, ingest logs into an array, send the array over the network to our servers. I don't think there's anything unusual in those. I expect bboltdb is a bunch of file io, and network is network.

chaudum commented 11 months ago

I experience the same crackling on a Lenovo ThinkPad X1 Carbon Gen 9 running Debian with kernel 6.5 and Sway window manager.

The external USB audio interface is a Motu M4.

$ /usr/local/kolide-k2/bin/launcher -version
{"caller":"main.go:30","msg":"Launcher starting up","revision":"d30592bbcec072fd6988d4e2346b1765f705084b","severity":"info","ts":"2023-10-19T07:11:56.876186415Z","version":"0.11.22"}
{"caller":"main.go:33","msg":"Launcher starting up","revision":"ebbf392d7867bc8427840b8d5de9ccf629c8647e","severity":"info","ts":"2023-10-19T07:11:56.889465862Z","version":"1.1.2"}
launcher - version 1.1.2
  branch:   HEAD
  revision:     ebbf392d7867bc8427840b8d5de9ccf629c8647e
  build date:   2023-10-04
  build user:    (runner)
  go version:   go1.21.1
$ uname -r
6.5.0-1-amd64
$ pipewire --version
pipewire
Compiled with libpipewire 0.3.82
Linked with libpipewire 0.3.82

I don't see any messages in dmesg, however, the crackling happens when CPU usage of the kolide process spikes - which happens irregularly, but can be observed when running top. I assume there is some OSquery that interrupts other system processes, even though they have higher rt priority (such as audio processes).

directionless commented 11 months ago

I assume there is some OSquery that interrupts other system processes, even though they have higher rt priority (such as audio processes).

I'm not sure what would do that.

sechmann commented 7 months ago

Would just like to say that I'm still experiencing this, and it's quite annoying. Any chance to get this fixed?

RebeccaMahany commented 7 months ago

Hi @sechmann, I'm sorry for the delay on the follow-up for this. Could I get the exact brand/model for your USB dock and external USB sound card? I haven't been able to reproduce this issue yet, but I get that it's super annoying and would like to get it resolved!

sechmann commented 6 months ago

@RebeccaMahany It's this one: https://www.aten.com/global/en/products/usb-solutions/docks-and-switches/us434/