quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
7.84k stars 316 forks source link

Janitor for OTEL traces on GCS throws "Cannot create buckets using a POST" error #3258

Closed Tyrion85 closed 1 year ago

Tyrion85 commented 1 year ago

Describe the bug When using Quickwit as Jaeger backend (OTEL traces), with Google Cloud Storage, as described here and here, setup itself works fine. However, Janitor throws errors:

quickwit 2023-05-03T14:34:44.633Z ERROR delete_splits_marked_for_deletion{index_id="otel-trace-v0" updated_before_timestamp=1683124364}: quickwit_janitor::garbage_collection:
Failed to delete ["01GZGZP9VM2FJ5Z4MP4PHE67EE.split", "01GZH092M1EWFTX12VG2G24F3H.split", "01GZH08KQSYASA3N9RCW3ZY4BN.split", "01GZH0CA2J2M05E3VKDE5NC7PR.split", "01GZH00KFR4XB8H61SZQRZ3NKY.split"] and 190 other splits. error=Some(StorageError { kind: InternalError, source: Request ID: None Body: <?xml version='1.0' encoding='UTF-8'?><Error><Code>InvalidArgument</Code><Message>Invalid argument.</Message><Details>Cannot create buckets using a POST.</Details></Error> 

This is bad, as no cleaning up occurs and old data piles up.

Steps to reproduce (if applicable) Steps to reproduce the behavior:

Bog-standard Quickwits 0.5.0 deployment on Kubernetes (GKE), with Quickwit Helm Chart version 0.3.4.

Values.yaml:

config:
  s3:
    endpoint: https://storage.googleapis.com
    access_key: "[redacted]"
    secret_key: "[redacted]"
  default_index_root_uri: s3://[redacted]/indexes

Service account to which access/secret key belong to, has a role roles/storage.objectAdmin assigned.

Expected behavior Expected behaviour is for Janitor to clean up old data.

Configuration: Provided above in steps to reproduce section.

  1. Output of quickwit --version Quickwit v0.5.0 (d4be690 2023-03-17T08:50:28Z)

Are there any workarounds for this? As far as I can see, no data is being deleted.

Again, this is just a vanilla helm chart installation. Is there anything that can be configured, that I missed, which would help with this issue?

trinity-1686a commented 1 year ago

According to GCS documentation, what Amazon call DeleteObjects and GCS call Multiple object delete is not supported by GCS. I doubt there is a way to fix this issue without modifying Quickwit. It shouldn't be too hard to fix from inside Quickwit (add a "GCS workaround" env var which loop over DeleteObject instead of using DeleteObjects). Such a change would need for #3168 to land first to avoid large conflicts

fmassot commented 1 year ago

@Tyrion85 thanks for the report. We are going to solve the issue for the next release planned for this month. Does this work for you?

Tyrion85 commented 1 year ago

@fmassot of course! Thank you very much!

In the interim, for curiosity's sake, would utilising GCS's lifecycle policy, to automatically delete "old" objects from a bucket work, or would it cause some unexpected issues in Quickwit? Specifically in case of OTEL traces (otel-trace-v0 index)

fmassot commented 1 year ago

In the interim, for curiosity's sake, would utilising GCS's lifecycle policy, to automatically delete "old" objects from a bucket work, or would it cause some unexpected issues in Quickwit? Specifically in case of OTEL traces (otel-trace-v0 index)

Well, it can definitely lead to unexpected issues :). Let me explain what will happen:

  1. First GCS policy delete a split file older than X days. The split information will still be in the metastore.

  2. You make a search query on all traces... then two things could happen:

    • the split deleted from GCS is in state MARKED_FOR_DELETION in the metastore. Quickwit puts such a state if the retention policy kicks in or if the split was merged. In this case, Quickwit will return a normal search response as it will not query this "ghost" split.
    • the split deleted from GCS is in the state PUBLISHED... Quickwit will query this split and will return an error saying that it did not find the split.

If you make a query on only recent traces, you won't see this error. But your metastore will not be in a sane state and you will potentially run queries that may return errors.

Tyrion85 commented 1 year ago

makes perfect sense @fmassot thank you for the detailed explanation! ๐Ÿ™๐Ÿผ was suspecting as much, but better to ask just in case ๐Ÿ˜„

fulmicoton commented 1 year ago

I have a preference for NOT using a environment variable.

Maybe we can

Granted everything sucks in its own way

guilload commented 1 year ago

Closed via #3446 and #3467.

guilload commented 1 year ago

@Tyrion85, in Quickwit 0.6, you'll be able to disable the use of the multi-object delete requests, which are not supported by GCS, by adding the following storage configuration to your node config:

storage:
  s3:
    disable_multi_object_delete_requests: true