janmg / logstash-input-azure_blob_storage

This is a plugin for Logstash to fetch files from Azure Storage Accounts
Other
30 stars 8 forks source link

Reading Gzip file on azure blob containing json #25

Open ashwinmuni opened 2 years ago

ashwinmuni commented 2 years ago

Possibility to read gz files stored on azure blob. The gz contains json files, have done the following config and below is the error

input {
    azure_blob_storage {
        storageaccount => "ashwin"
        access_key => "12WB3f+exT2wImZgX+N7KgJw=="
        container => "india"
        codec => "json"
    }
}

output {
      elasticsearch {
        user => "elastic"
        password => "F##@AbwOzN"
        ssl => true
        ssl_certificate_verification => false
        hosts => [ "https://127.0.0.1:9200/" ]
        index => "assam-blob-%{+YYYY.MM.dd}"
        cacert => "/etc/logstash/http_ca_1.crt"
      }
}

did tried with different codex - gzip_lines, didnt worked.

[2022-04-21T20:10:28,589][INFO ][logstash.inputs.azureblobstorage][main][2b92afafbd9b3b3a837d391ec4215c55812dc93f150871c0c89b7bcf205559ed] learn json one of the attempts failed BlobArchived (409): This operation is not permitted on an archived blob.

janmg commented 2 years ago

This one is a bit more complicated as ideally logstash-input-azure_blob_storage is only an input plugin. But this is a codec / filtering process issues, so I need to think how the flow would work best. Files can grow and the plugin tries to deal with partial reads and for json, there is a prefix and a postfix that should be taken into account. The learning can be skipped by manually configuring those, but I'm not sure how well logstash understands gzipped files. At least for reading logstash files

There is a codec which does json zipped files, but I don't think somebody actually published it yet. http://speakmy.name/2014/01/13/gzipped-json-files-and-logstash/

I don't have much time in the near future, but I can try to troubleshoot something in a couple of months.

ashwinmuni commented 2 years ago

Thanks @janmg let me also try to modify the bits and see if it successful, will keep posted

dimuskin commented 1 year ago

Any news with this feature?

janmg commented 1 year ago

Can you describe what is contained in those gzip files, how are they created and can they grow? Maybe I can created an experimental gzip decoder and let the rest be dealt with by the JSON codec. Ideally I only load files from azureblobstorage, but my plugin already became a bit more than an input plugin, so maybe cramming in a gzip decoder isnt going to be such a sin...I would appreciate a bit of feedback though on what the use case is.

dimuskin commented 1 year ago

@janmg thank you for your responce. I meant the functionality that is present in filebeat. https://www.elastic.co/guide/en/beats/filebeat/master/filebeat-input-azure-blob-storage.html

if we have a gzip file, then it must be unpacked before processing and content will be processed as regular text/json file.

one file per archive will be correct, otherwise the situation will only become more complicated.

about growing, gzip is no streaming protocol, so it can't grow.

janmg commented 1 year ago

I found two codecs named json_gz that can read gzipped json files. Both versions of the codec work with the azure_blob_storage input plugin, but I haven't tested exactly how it would process the json file. I assume it's processed as a single logstash event, which means that you have to split events in the logstash filter stage. If that codec doesn't work because your input has for instance json_lines, it's still easiest to make modifications to the codec rather than the input plugin.

https://github.com/dterziev/logstash-codec-json_gz/ https://github.com/ador-mg/logstash-codec-json_gz

The dterziev version is in rubygems version 1.0.1 sudo -u logstash /usr/share/logstash/bin/logstash-plugin install logstash-codec-json_gz

Both codecs can be configured like this, because the share the same name input { azure_blob_storage { codec => "json_gz" } }

For future reference, elastic themselves are working on this filebeat. it's in beta and in x-pack. I like it that it's in GO rather then Ruby, but I don't know the state and if it can replace my plugin. https://github.com/elastic/beats/blob/main/x-pack/filebeat/input/azureblobstorage/input.go