vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.13k stars 1.6k forks source link

Add support to Azure blob sink to configure `Content-Encoding` and `Content-Type` #21795

Open heshanperera-alert opened 3 days ago

heshanperera-alert commented 3 days ago

A note for the community

Problem

When using azure blob sink to upload some log files in gzip format, vector does add 'content-encoding' header. When we try to download and extract the gzip file we are running in to file corrupted error. However when we try to manually remove the content-encoding header from the file and then download the file, everything work as expected. There doesnt seem to have a way to remove this header from the configuration. What should we do? Following is the file properties on azure portal.

image

Configuration

No response

Version

0.37.1

Debug Output

vector  | 2024-11-13T21:13:00.214520Z DEBUG sink{component_kind="sink" component_id=azstorage_out component_type=azure_blob}:request{request_id=121}:request: azure_core::policies::transport: the following request will be passed to the transport policy: Request {
vector  |     url: Url {
vector  |         scheme: "https",
vector  |         cannot_be_a_base: false,
vector  |         username: "",
vector  |         password: None,
vector  |         host: Some(
vector  |             Domain(
vector  |                 "xxxx.blob.core.windows.net",
vector  |             ),
vector  |         ),
vector  |         port: None,
vector  |         path: "xxxx/2024/11/12/rcs-2024-11-12-21-29-21.json-22dc3472-6d30-4719-a867-23678e88b43a.log.gz",
vector  |         query: None,
vector  |         fragment: None,
vector  |     },
vector  |     method: Put,
vector  |     headers: Headers(
vector  |         {
vector  |             HeaderName(
vector  |                 "user-agent",
vector  |             ): HeaderValue(
vector  |                 "azsdk-rust-storage/0.17.0 (1.77.0; linux; aarch64)",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-blob-type",
vector  |             ): HeaderValue(
vector  |                 "BlockBlob",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-version",
vector  |             ): HeaderValue(
vector  |                 "2020-10-02",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-date",
vector  |             ): HeaderValue(
vector  |                 "Wed, 13 Nov 2024 21:13:00 GMT",
vector  |             ),
vector  |             HeaderName(
vector  |                 "content-length",
vector  |             ): HeaderValue(
vector  |                 "26712",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-blob-content-encoding",
vector  |             ): HeaderValue(
vector  |                 ***"gzip",***
vector  |             ),
vector  |             HeaderName(
vector  |                 "authorization",
vector  |             ): HeaderValue(
vector  |                 "SharedKey xxxxs=",
vector  |             ),
vector  |             HeaderName(
vector  |                 "x-ms-blob-content-type",
vector  |             ): HeaderValue(
vector  |                 "text/plain",
vector  |             ),
vector  |         },
vector  |     ),
vector  |     body: Bytes(

Example Data

No response

Additional Context

No response

References

No response

pront commented 3 days ago

Hi @heshanperera-alert, thanks for creating this issue.

When we try to download and extract the gzip file we are running in to file corrupted error.

Can you help me understand the following, is the Vector request accepted or rejected? If the Vector request is successful, does the Azure portal return an error when you attempt to download the file?

heshanperera-alert commented 3 days ago

Hello @pront

No request does not fail, its successful. Azure doesn't return an error when downloading either. Its downloading successfully, but when i am about to extract, it just gives me the error image

when i use gunzip

:~/Downloads/ > gunzip rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz gunzip: rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz: not in gzip format

pront commented 3 days ago

Is this a valid gzip compressed file or your blob is raw bytes? For the latter, you have to set this https://vector.dev/docs/reference/configuration/sinks/azure_blob/#compression to none.

heshanperera-alert commented 3 days ago

@pront i believe its a valid gzip. if i remove the gzip header from the azure portal by editing the blob file everything works fine. I can decompress after downloading the file. I dont want to make the compression to none since we are going to send TBs of data and would like to keep it minimum with compression.

pront commented 3 days ago

I see, thank you for sharing these details. Internally we set the BlobContentEncoding which ultimately determines the value of the "x-ms-blob-content-encoding" header.

Unfortunately, I don't have an Azure environment that I can use to test this myself but I am not convinced that removing the content encoding header is the right thing to do. I wonder if it's an issue with Azure or with the crate version we are using.

heshanperera-alert commented 3 days ago

@pront do you have any workaround you think to get around this in the short run. Its unrealistic to remove the header from each and everyfile on azure blob as we do have millions of files out there and theres no command to do that from azure cli either.

jszwedko commented 3 days ago

The note on the compression option on the AWS S3 sink may be relevant here:

Some cloud storage API clients and browsers handle decompression transparently, so depending on how they are accessed, files may not always appear to be compressed.

https://vector.dev/docs/reference/configuration/sinks/aws_s3/#compression

The same thing may apply to Azure Blob Storage. That is: if you download via the browser or some SDKs the file will be transparently decompressed when downloading.

heshanperera-alert commented 3 days ago

@jszwedko interesting, good thing on s3 sink is it has the ability to override the content-encoding header. azure blob sink doesnt have that capability

pront commented 3 days ago

Based a quick internet search (see this), Jesse is right. Azure decompresses automatically.

Did you inspect the contents of the downloaded file on your host? Let us know, if so we can close this issue.

(Note that gzip files start with the magic bytes [0x1f, 0x8b])

heshanperera-alert commented 3 days ago

@pront aint the blob sink should have the same capability like s3 sink, so that we could override the header?

pront commented 2 days ago

Are you referring to these?

We can add these to the Azure Blob Storage sink as well. Not opposed to that 👍


What I am trying to understand is, if we have a Vector bug or not. If I am reading the above correctly, the downloaded blob is already decompressed but has a gz extension.Should be easy to verify is this on your side. This comment explains in detail how to get raw data without decompressing using Python APIs.

heshanperera-alert commented 2 days ago

oh yeah sorry havent answered your question regarding the magic bytes. Its not having the 0x1f, 0x8b.

~/Documents/git/vector-eventhub-poc/ > hexdump -C -n 16 ~/Downloads/rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz 00000000 7b 22 54 69 6d 65 73 74 61 6d 70 22 3a 22 32 30 |{"Timestamp":"20| 00000010

heshanperera-alert commented 2 days ago

@pront do you know when the feature to overwrite the headers can be added to blob sink?

pront commented 2 days ago

@pront do you know when the feature to overwrite the headers can be added to blob sink?

Unfortunately this is not on our radar, there's on open feature request for this. If you are motivated, you are welcome to submit a PR and we will review it.

pront commented 2 days ago

oh yeah sorry havent answered your question regarding the magic bytes. Its not having the 0x1f, 0x8b.

~/Documents/git/vector-eventhub-poc/ > hexdump -C -n 16 ~/Downloads/rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz 00000000 7b 22 54 69 6d 65 73 74 61 6d 70 22 3a 22 32 30 |{"Timestamp":"20| 00000010

Thank you for confirming. You can also inspect the contents to see if it matches what you published as one more verification step.