GRR Client Crashes "Serialized message too large"

bprykhodchenko commented 2 years ago

Environment

The GRR is installed on a VM running on ESXi on-prem. The VM runs Ubuntu 18.04 and GRR was installed from DEB (using the official documentation)
GRR Version is 3.4.6.0
Ubuntu 18.04
Windows 10

Describe the issue When I do the Memory Dump of All processes except GRR it works fine for some time, but at some point in time I get this message:

CRITICAL:2022-05-30 11:50:07,252 fleetspeak_client:117] Fatal error occurred: Traceback (most recent call last): File "site-packages\grr_response_client\fleetspeak_client.py", line 111, in _RunInLoop File "site-packages\grr_response_client\fleetspeak_client.py", line 209, in _SendOp File "site-packages\grr_response_client\fleetspeak_client.py", line 176, in _SendMessages File "site-packages\fleetspeak\client_connector\connector.py", line 144, in Send File "site-packages\fleetspeak\client_connector\connector.py", line 154, in _SendImpl ValueError: Serialized message too large, size must be at most 2097152, got 2323650

So it doesn't like the message size. Now, the question, where this limit can be increased?

ITPro17 commented 2 years ago

Yes, I know the solution.

You need to downgrade the kernel of your OS to 16
Remove the database you have. Instead, install the Hadoop cluster.
Connect GRR to Hadoop

It should fix the issue.

max-vogler commented 2 years ago

Thanks for your report. This looks like a legit issue on the GRR client side, we'll look into it. Increasing this limit on the client side likely creates more problems on the server side, so changing the chunking logic or similar is probably the way forward.

bprykhodchenko commented 2 years ago

Thanks for your report. This looks like a legit issue on the GRR client side, we'll look into it. Increasing this limit on the client side likely creates more problems on the server side, so changing the chunking logic or similar is probably the way forward.

So I have tried to decrease the Chunk size to 2000000, which is less then the agent able to receive and the same issue occured:

CRITICAL:2022-06-01 10:20:24,761 fleetspeak_client:117] Fatal error occurred: Traceback (most recent call last): File "site-packages\grr_response_client\fleetspeak_client.py", line 111, in _RunInLoop File "site-packages\grr_response_client\fleetspeak_client.py", line 209, in _SendOp File "site-packages\grr_response_client\fleetspeak_client.py", line 176, in _SendMessages File "site-packages\fleetspeak\client_connector\connector.py", line 144, in Send File "site-packages\fleetspeak\client_connector\connector.py", line 154, in _SendImpl ValueError: Serialized message too large, size must be at most 2097152, got 2579672

So it is definitely something to be fixed in GRR.

mbushkov commented 2 years ago

Ok, so what happens here is pretty interesting. The issue, most definitely, happens on the client side and has nothing to do with how the server database is set up.

When working through Fleetspeak, the GRR client runs as a subprocess of the Fleetspeak client. They communicate through shared file descriptors. When a GRR client wants to send a message to its server, it sends a message to the Fleetspeak client on the same machine through the shared fd. Now, Fleetspeak client has a hard message size limit of 2mb: https://github.com/google/fleetspeak/blob/93b2b9a40808306722875abbd5434af4634c6531/fleetspeak/src/client/channel/channel.go#L32

The issue happens because GRR tries to send a message that's bigger than 2mb. There's a dedicated check for this in the GRR client fleetspeak connector code (MAX_SIZE is set to 2Mb): https://github.com/google/fleetspeak/blob/master/fleetspeak_python/fleetspeak/client_connector/connector.py#L151

GRR should be careful enough to chunk the messages. Not sure why chunking failed in this case - will investigate further.

@bprykhodchenko Could you please specify the exact flow arguments you used to reproduce the issue?

mbushkov commented 2 years ago

I looked at the YaraProcessDump client action. It dumps the memory on disk and the sends back a data structure with information about all the processes: https://github.com/google/grr/blob/master/grr/client/grr_response_client/client_actions/memory.py#L767

What this means: if the result proto is larger than 2Mb in the serialized form, the client action will fail. If the machine has a lot of memory and a lot of processes, then growing over 2Mb is likely possible. We need to look into either:

Chunking response, or
Increasing the limit from 2Mb to a higher value. I have to check what's the motivation for the 2Mb limit is.

bprykhodchenko commented 2 years ago

Hello, as for your question - I just run the YARA Memory Dump from UI, I do not use CLI to include specific command line arguments.

As for the solution,

I was changing the chunk size to smaller then the client can "eat" (in flow parameters), but I was running into the same issue.
Should I wait for a fixed version, OR
Should I download the source code, change the MAX_SIZE in connect.py file to, say, 4 MB, and install the server from the source code?

mbushkov commented 2 years ago

A few comments:

The issue is related to how many processes you dump at the same time with a single flow. GRR client tries to send a data structure with memory regions map to the server and if this data structure is too big, you get the failure . One workaround option is to, for example, run 2 flows with process regexes, one matching processes with names from a to k, and the other one matching processes with names from l to z. That will likely help.
The right fix is to make YaraProcessDump client actions chunk its output. I will look into this next week - unfortunately, can't provide an eta until I start working on it.
Changing MAX_SIZE on the GRR side is only a part of the solution. 2Mb limit is also hardcoded on the Fleetspeak client side. Fleetspeak is written in Go and is shipped with GRR in binary form (see https://pypi.org/project/fleetspeak-client-bin/). You'd need to recompile the fleetspeak-client Go binary and replace the fleetspeak-client-bin package (see) in order for the fix to work. It's not exactly straight-forward, but if you're feeling adventurous, you can try it. Point to the relevant place in the Fleetspeak code: https://github.com/google/fleetspeak/blob/master/fleetspeak/src/client/channel/channel.go#L48

google / grr

GRR Client Crashes "Serialized message too large" #978