Open heshanperera-alert opened 3 days ago
Hi @heshanperera-alert, thanks for creating this issue.
When we try to download and extract the gzip file we are running in to file corrupted error.
Can you help me understand the following, is the Vector request accepted or rejected? If the Vector request is successful, does the Azure portal return an error when you attempt to download the file?
Hello @pront
No request does not fail, its successful. Azure doesn't return an error when downloading either. Its downloading successfully, but when i am about to extract, it just gives me the error
when i use gunzip
:~/Downloads/ > gunzip rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz gunzip: rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz: not in gzip format
Is this a valid gzip compressed file or your blob is raw bytes? For the latter, you have to set this https://vector.dev/docs/reference/configuration/sinks/azure_blob/#compression to none
.
@pront i believe its a valid gzip. if i remove the gzip header from the azure portal by editing the blob file everything works fine. I can decompress after downloading the file. I dont want to make the compression to none since we are going to send TBs of data and would like to keep it minimum with compression.
I see, thank you for sharing these details. Internally we set the BlobContentEncoding which ultimately determines the value of the "x-ms-blob-content-encoding"
header.
Unfortunately, I don't have an Azure environment that I can use to test this myself but I am not convinced that removing the content encoding header is the right thing to do. I wonder if it's an issue with Azure or with the crate version we are using.
@pront do you have any workaround you think to get around this in the short run. Its unrealistic to remove the header from each and everyfile on azure blob as we do have millions of files out there and theres no command to do that from azure cli either.
The note on the compression
option on the AWS S3 sink may be relevant here:
Some cloud storage API clients and browsers handle decompression transparently, so depending on how they are accessed, files may not always appear to be compressed.
https://vector.dev/docs/reference/configuration/sinks/aws_s3/#compression
The same thing may apply to Azure Blob Storage. That is: if you download via the browser or some SDKs the file will be transparently decompressed when downloading.
@jszwedko interesting, good thing on s3 sink is it has the ability to override the content-encoding header. azure blob sink doesnt have that capability
Based a quick internet search (see this), Jesse is right. Azure decompresses automatically.
Did you inspect the contents of the downloaded file on your host? Let us know, if so we can close this issue.
(Note that gzip files start with the magic bytes [0x1f, 0x8b]
)
@pront aint the blob sink should have the same capability like s3 sink, so that we could override the header?
Are you referring to these?
We can add these to the Azure Blob Storage
sink as well. Not opposed to that 👍
What I am trying to understand is, if we have a Vector bug or not. If I am reading the above correctly, the downloaded blob is already decompressed but has a gz
extension.Should be easy to verify is this on your side. This comment explains in detail how to get raw data without decompressing using Python APIs.
oh yeah sorry havent answered your question regarding the magic bytes. Its not having the 0x1f, 0x8b.
~/Documents/git/vector-eventhub-poc/ > hexdump -C -n 16 ~/Downloads/rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz 00000000 7b 22 54 69 6d 65 73 74 61 6d 70 22 3a 22 32 30 |{"Timestamp":"20| 00000010
@pront do you know when the feature to overwrite the headers can be added to blob sink?
@pront do you know when the feature to overwrite the headers can be added to blob sink?
Unfortunately this is not on our radar, there's on open feature request for this. If you are motivated, you are welcome to submit a PR and we will review it.
oh yeah sorry havent answered your question regarding the magic bytes. Its not having the 0x1f, 0x8b.
~/Documents/git/vector-eventhub-poc/ > hexdump -C -n 16 ~/Downloads/rcs-2024-11-13-19-14-1-2f3ffb38-f2d7-40f4-b722-1871c45142ba.log.gz 00000000 7b 22 54 69 6d 65 73 74 61 6d 70 22 3a 22 32 30 |{"Timestamp":"20| 00000010
Thank you for confirming. You can also inspect the contents to see if it matches what you published as one more verification step.
A note for the community
Problem
When using azure blob sink to upload some log files in gzip format, vector does add 'content-encoding' header. When we try to download and extract the gzip file we are running in to file corrupted error. However when we try to manually remove the content-encoding header from the file and then download the file, everything work as expected. There doesnt seem to have a way to remove this header from the configuration. What should we do? Following is the file properties on azure portal.
Configuration
No response
Version
0.37.1
Debug Output
Example Data
No response
Additional Context
No response
References
No response