Dedupe events - high memory usage

luk-ada commented 1 month ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello

I'm streaming the GCP logs from the PubSub and ingest them to Logscale + Vector for log to metrics transformation.

It seams that Vector is not handling PubSub source very well and I have quite a lot of duplicates which are not acceptable by the customer. PubSub load is from 5k/s to 200k/s. I have two Vectors:

Source: PubSub > Transformation: Dedupe (cache 5 000 000 messages) > Sink: vector
Source: Vector > Transformation: Dedupe (cache 15 000 000 messages) > Sinks: humio_logs, vector

First Vector is using ~2,5-3 GB of memory. Second i using 15 GB of RAM and slowly growing all the time.

Is de-duplication working fine? I'm using fields.match by message_id which is string with 17 digits (bytes) so in regards to the Memory Utilization Estimation it should use 0,255 Gigabyte.

length("11497906994447355") 17

17 bytes 15000000 1e-9 = 0,255 Gigabye

I'm sharing configuration for the second Vector.

Configuration

api:
  address: 0.0.0.0:8686
  enabled: true
  playground: false
data_dir: /vector-data-dir
acknowledgements:
  enabled: true
sources:
  vector:
    type: vector
    address: 0.0.0.0:1500
  internal_streams_metrics_source:
    type: internal_metrics
transforms:
  dedupe:
    type: dedupe
    inputs:
      - vector
    cache:
      num_events: 15000000
    fields:
      match: ["message_id"]
sinks:
  internal_streams_metrics_sink:
    address: 0.0.0.0:9000
    default_namespace: service
    inputs:
      - internal_streams_metrics_source
    type: prometheus_exporter
    acknowledgements:
      enabled: false
  logscale_logs:
    type: humio_logs
    inputs:
      - dedupe
    endpoint: "http://***:8080"
    token: ${***}
    index: gcp
    event_type: gcp-parser
    encoding:
      codec: json
    acknowledgements:
      enabled: true
    batch:
      max_bytes: 1500000
      max_events: 1500
      timeout_secs: 1
    buffer:
      max_events: 10000
      type: memory
      when_full: block
    compression: none
  vector_metrics:
    type: vector
    inputs:
      - dedupe
    address: http://***:1500
    acknowledgements:
      enabled: true

Version

vector 0.39.0 (x86_64-unknown-linux-musl 73da9bb 2024-06-17 16:00:23.791735272)

Debug Output

No response

Example Data

example, message_id:

11497906994447360 11497906994447359 11497906994447358 11497906994447357 11497906994447356 11497906994447355 11497906994447354 11497906994447353 11497906994447352 11497906994447351

Additional Context

No response

References

No response

jszwedko commented 1 month ago

@luk-ada thanks for this report.

Do have a baseline to compare with, could you try running both Vectors without the dedupe transform and observe the memory use? I'd like to understand how much the transform may be adding vs the baseline.

As you noted, the overhead for the keys should be relatively small relatively small for your key field: message_id. For 17 byte keys, for 5 million messages it should be about 85 MB and for 15 million messages should be about 255 MB for just the internal state store of the dedupe transform. There is some static overhead per key but it should be on the order of a few bytes. I'm suspicious that the additional memory use is in the dedupe transform itself and not somewhere else in the pipeline, but it is possible that there is a bug.

luk-ada commented 1 month ago

@jszwedko thank you for the response. Below you can see RAM usage before dedupe enabled. In this case we are talking about first Vector which was configured with PubSub source and Logscale sink.

I've restarted second Vector on Friday, currently usign ~10GB of memory and slowly constantly growing.

PS. please ignore usage above limit - pod restarts.

jszwedko commented 1 month ago

Thanks @luk-ada , that is interesting. It does make it seem like the dedupe transform is causing a large increase in memory usage. Nothing jumped out when quickly reviewing the code.

I think one (or both) of two things could be helpful:

Creating a minimal reproducible example that I could run that manifests the behavior. This would include trimming the config to the minimal necessary and providing some mechanism to generate the input and feed it to a running Vector. Then I could more easily profile locally.
Collect a memory profile yourself by running Vector under valgrind and providing that here. That profile might make it easy to spot where the issue lies.

luk-ada commented 1 month ago

Hi @jszwedko

I've prepared simple setup to reproduce, please check attached zip. It looks like that 15M of messages is using around 5GB of RAM. I used WSL 1 on Windows 10, below my results.

2.6M 2 6m

5.4M 5 4m

11.5M 11 46m

16M 16m

20M 20m

25M 25m

No deduplication no_dedupe_4 6m

config + generator + vector.sh to run setup to reproduce:

. ├── data │ └── file ├── generate.py # simple generator written by Copilot ├── log ├── vector.sh └── vector.yaml

vector-dedupe.zip

jszwedko commented 1 month ago

Thanks for putting this reproduction together! I haven't had a chance to look at it yet, but it should help with reproducing and identifying the reason for the increased memory usage.

vectordotdev / vector