Better support for large S3 batches

lukesteensen commented 4 years ago

The current iteration of our S3 sink is built for sending small-to-medium size objects relatively frequently. This is apparent in a few aspects of the config:

Default max batch size of just 1k events
Default batch timeout of just 5 minutes
No way to set an unlimited batch size

This is in contrast with some other tools like fluentd that will create one file per hour by default. While the different approaches have their pros and cons, it would be a benefit if Vector smoothly supported the type of behavior users may be accustomed to from those other tools.

While users could technically just increase Vector's batch sizes and timeouts to approach the desired behavior, the implementation is not designed for it and it's not clear how well it would perform. There are a few things we could explore doing that would potentially help performance and resource usage:

Disk-based batching
Multi-part uploads

These each add a significant degree of complexity (mostly around state management across restarts), but they would decouple memory use from the desired S3 object size and likely provide smoother operation.

jszwedko commented 3 years ago

Missed this when I opened https://github.com/timberio/vector/issues/8787. Copying my thoughts from there over here:

Currently we build up entire batches of events before sending them all at once to AWS. This works well with our current architecture (in particular the BatchedSink), but introduces the need to, typically, use a disk buffer for end-to-end guarantees if the sink is linked to a source that holds resources open while awaiting the acknowledgement.

For example, if you are using the http source with the aws_s3 sink with a batch timeout of 300s and enable end-to-end acknowledgements, you could be holding open HTTP requests for up to 300s while the events associated with that request sit in a batch in Vector. This would typically not be tenable.

Instead, if we did a multipart upload I think we could upload each batch in smaller pieces with a timeout or threshold smaller than the overall batch thresholds.

fpytloun commented 2 years ago

We are facing very similar issue so instead of creating new one, I'll extend this one.

We have very high cardinality when constructing key_prefix and we are flushing into per-minute buckets. Even when disk buffering is used, batches are loaded into memory which might easily result in OOM.

With current setup like this, it works fine as it's flushing fast enough and memory usage is limited with batch.max_bytes but results in too many small objects which then introduces extreme overhead to consumers.

    [sinks.out_kafka_access_s3]
      type = "aws_s3"
      inputs = ["remap_kafka_access"]
      bucket = "access-logs"
      region = "us-west-2"
      key_prefix = "topics/{{ _topic }}/year=%Y/month=%m/day=%d/hour=%H/minute=%M/"
      # Write at least every minute
      batch.timeout_secs = 60
      request.concurrency = 2000
      request.rate_limit_num = 2000
      buffer.type = "memory"
      buffer.max_events = 3000    # default 500 with memory buffer
      #buffer.type = "disk"
      #buffer.max_size = 3000
      # NOTE: Setting batch size limit is crucial for memory usage. When
      # having high cardinality (key_prefix) like for access logs where
      # we write by topic and have 3k+ of them, we need to calculate like
      # this: mem_limit_MB / (topics / vector_instances) = batch.max_bytes
      batch.max_bytes = 1049000  # default 10MB
      encoding.codec = "ndjson"
      compression = "gzip"
      filename_extension = "json.gz"

My proposal is to avoid in memory batching but instead prepare batches on disks during disk buffering. That would allow to upload once a minute and cardinality or object sizes would not matter. (https://github.com/vectordotdev/vector/issues/10392)

Also possibly multi-part uploads might solve it but I think first option will scale better in this case.

jszwedko commented 2 years ago

👍 thanks @fpytloun . Disk-based batching is a neat idea that I don't think has been considered yet.

fpytloun commented 2 years ago

Created wrapper for s3sync, utilizing file sink with exec source: https://gist.github.com/fpytloun/155a975e1491d39a2b71647ca923a11a

Seems to be working pretty well and solves issues with high cardinality, etc.

tim-klarna commented 2 years ago

@fpytloun that's really neat!

fpytloun commented 2 years ago

Closed automatically, should be probably reopened before implemented?

spencergilbert commented 2 years ago

Indeed 😅

max3-05 commented 1 year ago

The issue seems pretty old. Is there any hope that it will be resolved? Thank you!

ee07b415 commented 8 months ago

Hi, we saw some behavior around the S3 batch not sure if it is related to this, with this config:

  my_sink_id:
    type: aws_s3
    batch:
      max_bytes: 41943040
      max_events: 1000000
    buffer:
      type: "memory"
      max_events: 1000000

I'm hoping we should saw at least 1M rows for log send to the S3, but in the end it always came with less than 50k, and if I set the buffer max event less than 50k, the output file will be set to the relative size. So there seems another limit in batch/buffer block the size of the file we wrote to S3 bigger than 50k rows

jszwedko commented 8 months ago

Hi, we saw some behavior around the S3 batch not sure if it is related to this, with this config:
  my_sink_id:
    type: aws_s3
    batch:
      max_bytes: 41943040
      max_events: 1000000
    buffer:
      type: "memory"
      max_events: 1000000
I'm hoping we should saw at least 1M rows for log send to the S3, but in the end it always came with less than 50k, and if I set the buffer max event less than 50k, the output file will be set to the relative size. So there seems another limit in batch/buffer block the size of the file we wrote to S3 bigger than 50k rows

Is it possible that the other batch limits: either the byte limit you have configured or the default timeout limit of 300s? Also, the default key_prefix writes one object per-day and so it will write one file per-day, minimum, regardless.

fcoelho commented 8 months ago

Wanted to mention that I had a similar experience when setting up the S3 sink, but that seemed to be mostly related to https://github.com/vectordotdev/vector/issues/10020. With batch.max_bytes set to 50M I was able to see Vector's resident memory grow/shrink by the configured 50MB, but each batch in S3 was around ~13MB uncompressed. I have way less than 50k events in each batch, so the comment above is most likely running into the max_bytes limit first

vladimirfx commented 8 months ago

Wanted to mention that I had a similar experience when setting up the S3 sink, but that seemed to be mostly related to #10020. With batch.max_bytes set to 50M I was able to see Vector's resident memory grow/shrink by the configured 50MB, but each batch in S3 was around ~13MB uncompressed. I have way less than 50k events in each batch, so the comment above is most likely running into the max_bytes limit first

Observe the same behavior - 50Mb batch.max_bytes results in ~4Mb raw log files (encoding.codec=raw_message). With default settings (10Mb) log file is around 400Kb.

S3 by nature designed for large objects so such an archive overloads the metadata layer for on-promise S3 and increases costs for Cloud S3 deployments. Waiting for file bases batching.

fpytloun commented 8 months ago

@fcoelho @vladimirfx Unfortunately only workaround is still using file sink with my s3sync wrapper: https://gist.github.com/fpytloun/155a975e1491d39a2b71647ca923a11a

We are using this in production and it works well for more than a year.

vladimirfx commented 8 months ago

@fcoelho @vladimirfx Unfortunately only workaround is still using file sink with my s3sync wrapper: https://gist.github.com/fpytloun/155a975e1491d39a2b71647ca923a11a

We are using this in production and it works well for more than a year.

Thank you! We tried to avoid a workaround before but seems we should implement your recipe.

fpytloun commented 8 months ago

@fcoelho @vladimirfx Unfortunately only workaround is still using file sink with my s3sync wrapper: https://gist.github.com/fpytloun/155a975e1491d39a2b71647ca923a11a We are using this in production and it works well for more than a year.

Thank you! We tried to avoid a workaround before but seems we should implement your recipe.

I also tried to come up with better solution but without success. I am thinking if there might be some other option like streaming logs into AWS SQS or some other service (Kinesis) and then somehow processing into S3, maybe with Lambda or natively if such service supports it. If you are willing to investigate this option I am very interested in findings.

vladimirfx commented 8 months ago

@fcoelho @vladimirfx Unfortunately only workaround is still using file sink with my s3sync wrapper: https://gist.github.com/fpytloun/155a975e1491d39a2b71647ca923a11a We are using this in production and it works well for more than a year.

Thank you! We tried to avoid a workaround before but seems we should implement your recipe.

I also tried to come up with better solution but without success. I am thinking if there might be some other option like streaming logs into AWS SQS or some other service (Kinesis) and then somehow processing into S3, maybe with Lambda or natively if such service supports it. If you are willing to investigate this option I am very interested in findings.

Unfortunately, S3 is by far more common protocol than any kind of messaging. We use GCP, Blackbaze, on promise S3 storages for log archives. So there is no general solution (may be Kafka, but I prefer workaround instead).

mozhi-bateman commented 3 weeks ago

+1 , need multipart upload support for s3 , gcs .

vectordotdev / vector

Better support for large S3 batches #3829