Closed michaelhui8 closed 1 year ago
In principle the codec should take care of JSON decoding, there is also a JSON_LINES codec that will read one line and parses the JSON line by line. Your case looks like this format, because the } is not follow by a comma before the { ... The JSON codec is picky and will probably complain it's not well formatted
Doesn't API Management provide an eventhub? Events pushed when they come in, is in my opinion a nicer interface than the blob storage which has to pull regularly from a storage and track file states. I wrote the plugin, because eventhub isn't always supported.
Regarding the question of why use this plugin, is because we want to store the raw data from apim and other services in a data lake and then pull it out for further processing or reprocessing.
On Fri, Sep 16, 2022, 1:50 PM janmg, @.***> wrote:
In principle the codec should take care of JSON decoding, there is also a JSON_LINES codec that will read one line and parses the JSON line by line. Your case looks like this format, because the } is not follow by a comma before the { ... The JSON codec is picky and will probably complain it's not well formatted
— Reply to this email directly, view it on GitHub https://github.com/janmg/logstash-input-azure_blob_storage/issues/29#issuecomment-1249641133, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARCWWLGDM4Q4GAL5PJ5U5C3V6SXM5ANCNFSM6AAAAAAQOP7IYM . You are receiving this because you authored the thread.Message ID: @.***>
there is also another behavior i've observed, i think there is some strange behavior that causes line 208 to maybe get executed. Look at this log, it shows that when processing the same file on subsequent intervals, sometimes the offset is reset back to 0!! Causing all events to be re-read again.
this is 3 intervals look how the starting offset is reset, why would this be happening?
Line 511: [2022-09-16T19:15:52,989][INFO ][logstash.inputs.azureblobstorage][main][1f2ff93d2dbcf378b4bfefb7953b078b81a7bed85ee9f28dbd53566d09de50ac] 3: processing resourceId=/SUBSCRIPTIONS/ABDE4308-7B2A-43F0-8557-3ACFFB6BEE0B/RESOURCEGROUPS/RG-OAGELK-DEV-001/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/APIM-OAGELK-DEV-002/y=2022/m=09/d=16/h=23/m=00/PT1H.json from 0 to 2052 Line 1240: [2022-09-16T19:21:55,001][INFO ][logstash.inputs.azureblobstorage][main][1f2ff93d2dbcf378b4bfefb7953b078b81a7bed85ee9f28dbd53566d09de50ac] 3: processing resourceId=/SUBSCRIPTIONS/ABDE4308-7B2A-43F0-8557-3ACFFB6BEE0B/RESOURCEGROUPS/RG-OAGELK-DEV-001/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/APIM-OAGELK-DEV-002/y=2022/m=09/d=16/h=23/m=00/PT1H.json from 2052 to 3078 Line 1370: [2022-09-16T19:22:58,345][INFO ][logstash.inputs.azureblobstorage][main][1f2ff93d2dbcf378b4bfefb7953b078b81a7bed85ee9f28dbd53566d09de50ac] 3: processing resourceId=/SUBSCRIPTIONS/ABDE4308-7B2A-43F0-8557-3ACFFB6BEE0B/RESOURCEGROUPS/RG-OAGELK-DEV-001/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/APIM-OAGELK-DEV-002/y=2022/m=09/d=16/h=23/m=00/PT1H.json from 0 to 3078 Se
Thanks for the explanation on the retention from eventhub to blobstorage. It's useful for me to understand the use case. https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview
I originally added line 208, because this plugin relies on the microsoft ruby library azure-storage-blob which in turn relies on faraday for the http connection to the blobstore. Older versions would disconnect unexpectedly after thousands of entries worked fine and reprocessing made sense. Otherwise some events would not get processed.
But this happens without considering what exception happened and this is the only place where I didn't added a log.error statement. I pushed a version that prints the error message and I recreated with the two log entries, your situation. The exception I'm seeing now is "caught: undefined method `[]'" ... the warning from jsonlines is much more useful.
jsonlines - JSON parse error, original data now in message field {:message=>"Unexpected character ('t' (code 116)): was expecting comma to separate Object entries
When I check the validity of the json, the linter complains about this bit. "message":"{ "test": "jsonformat message"}"
The embedding of a json message inside a json-string without escaping the double-quotes is a problem for the json parser that sees message as "{ " and wonders why there is a t from test and no comma for the next json field.
This parsing happens in the jsonlines codec and not in my plugin. I will add better logging for why the exception happened
For a different project I setup a pipeline to first fix the json mistake in a ruby string search and replace before parsing it into json fields.
Classic Microsoft =)
I tried using the "line" codec and it's the same, in other words, same lines are processed again because offset are reset back to 0. As i understand if line codec is used then json parsing is disabled.
Anyhow i inserted that line into my copy of the gem and ran the test again here's the error at 208, but now i realized i don't even have the same version as on your master branch, because line 208 isn't the same.
According to the directory structure i have 0.12.3.
Anyhow i updated to use your code in your master branch.. I found 1 instance the "caught" during the first interval.
Then in subsequent intervals I saw 1 correct offset addition and then after that, it when back to 0 again, with NO caught message, so that means something else in the logic is resetting the offset to 0.
Line 261: [2022-09-18T15:10:55,551][INFO ][logstash.inputs.azureblobstorage][main][1f2ff93d2dbcf378b4bfefb7953b078b81a7bed85ee9f28dbd53566d09de50ac] 3: processing resourceId=/SUBSCRIPTIONS/ABDE4308-7B2A-43F0-8557-3ACFFB6BEE0B/RESOURCEGROUPS/RG-OAGELK-DEV-001/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/APIM-OAGELK-DEV-002/y=2022/m=09/d=18/h=19/m=00/PT1H.json from 0 to 1026 Line 1127: [2022-09-18T15:17:58,714][INFO ][logstash.inputs.azureblobstorage][main][1f2ff93d2dbcf378b4bfefb7953b078b81a7bed85ee9f28dbd53566d09de50ac] 3: processing resourceId=/SUBSCRIPTIONS/ABDE4308-7B2A-43F0-8557-3ACFFB6BEE0B/RESOURCEGROUPS/RG-OAGELK-DEV-001/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/APIM-OAGELK-DEV-002/y=2022/m=09/d=18/h=19/m=00/PT1H.json from 1026 to 3078 Line 1270: [2022-09-18T15:19:01,119][INFO ][logstash.inputs.azureblobstorage][main][1f2ff93d2dbcf378b4bfefb7953b078b81a7bed85ee9f28dbd53566d09de50ac] 3: processing resourceId=/SUBSCRIPTIONS/ABDE4308-7B2A-43F0-8557-3ACFFB6BEE0B/RESOURCEGROUPS/RG-OAGELK-DEV-001/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/APIM-OAGELK-DEV-002/y=2022/m=09/d=18/h=19/m=00/PT1H.json from 0 to 3078
My guess the problem is with try_list_blobs function. Could you help me understand what it's trying to do?
based on your calling logic, i would think you're trying to set the offset to the previously read number of bytes, and the just updating the new length. but it seems you set offset to 0 on line 481.
Hi I think i got it.. you're code on line read partial file does not update the "size" variable. and thus 0 gets stored into the registery on every partial read. I think you need to add code to support multiple partial reads, since those log files in the blob will grow over the hour, and would be severe limitation if i were only to be able to set the interval to greater than 1 hr. Pls confirm what what i am saying is correct.
I pushed 0.12.4 to rubygems just now with the fixes of line 208 that now doesn't set the offset back to 0 and contains improvements from @nttoshev on partial reads of files that need the json_lines codec.
Your example above is a json_lines format, every line is a new event. However your event is not valid json. This triggered an exception. That spot was originally suffering from connection issues, where setting the offset to 0 would ensure reprocessing of the file. But with newer faraday this is not needed.
The exception I get with your samples is "caught: undefined method `[]' for nil:NilClass" and the warning JSON parse error, original data now in message field {:message=>"Unexpected character ('t' ... this is because your JSON message contains an embedded json message, which is not escaped.
In the end of the line there is a "message":"{"test"} ... but the double quote that was supposed to start the word test actually closed the { as a string ... this is not valid json and the json_lines codec doesn't handle this.
partial reads should go fine if those are complete lines. the idea of this plugin is that it detects the files has grown since the last time it scanned and processes the new bytes only. But due to the exception and the previous behavior it restarted processing from 0. If the thing writing to the blobstore only commits complete block blobs, the lines should be read correctly.
The interval determines the rhythm in which the files system growth is checked. you can use an interval of one second and the file listing will happen pretty much continuously. The processing of the files would also not pause until the next interval. You would just be pounding the system, so I wouldn't go any lower the 60 second intervals. No need to wait for an hour. The plugin can deal with partial reads. It has to because you can't predict if a file will grow or not.
Did you check version 0.12.4? It should not set offsets ack to 0, but I still think the JSON inside JSON is invalid and crashes the json_line codec
No further information
I need to know if this plugin supports the reading of the diagnostic logs which are forwarded to blob storage. I get json parse errors.
the apim logs like this I have 2 log lines in this blob, what is the corresponding log type?
{ "DeploymentVersion": "0.30.1742.0", "Level": 4, "isRequestSuccess": true, "time": "2022-09-16T15:22:06.5944852Z", "operationName": "Microsoft.ApiManagement/GatewayLogs", "category": "GatewayLogs", "durationMs": 17, "callerIpAddress": "76.75.136.14", "correlationId": "c355dc72-0dbc-4e0d-b34c-9e02e1f96e9a", "location": "Canada Central", "properties": {"method":"GET","url":"https://apim-oagelk-dev-002.azure-api.net/api/fhir/test","backendResponseCode":200,"responseCode":200,"responseSize":486,"cache":"none","backendTime":16,"requestSize":660,"apiId":"providertemplate","operationId":"get-test","clientProtocol":"HTTP/1.1","backendProtocol":"HTTP/1.1","apiRevision":"1","clientTlsVersion":"1.2","backendMethod":"GET","backendUrl":"https://simulatedbackend.azurewebsites.net/api/HttpTrigger2/test","requestHeaders":{"Authorization":"Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiIiLCJpc3QiOiIiLCJpYXQiOjE2NjMzNDE3MjYsImV4cCI6MTY2MzM0MTc1NiwianRpIjoiand0X25vbmNlIiwidXNlcm5hbWUiOiJ0ZXN0dXNlcjMifQ.FvMLBnwnVP2XKKmLGvoQUtDgaQE-gtcCYYaYq-gJ4DU"},"traceRecords":[{"message":"testuser3","severity":"Information","timestamp":"2022-09-16T15:22:06.5944852Z"},{"message":"{ \"test\": \"jsonformat message\"}","severity":"Information","timestamp":"2022-09-16T15:22:06.5944852Z"}]}, "resourceId": "/SUBSCRIPTIONS/ABDE4308-7B2A-43F0-8557-3ACFFB6BEE0B/RESOURCEGROUPS/RG-OAGELK-DEV-001/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/APIM-OAGELK-DEV-002"} { "DeploymentVersion": "0.30.1742.0", "Level": 4, "isRequestSuccess": true, "time": "2022-09-16T15:39:20.9565494Z", "operationName": "Microsoft.ApiManagement/GatewayLogs", "category": "GatewayLogs", "durationMs": 72, "callerIpAddress": "76.75.136.14", "correlationId": "371bbe9c-e663-49a7-904a-30e3fcc93700", "location": "Canada Central", "properties": {"method":"GET","url":"https://apim-oagelk-dev-002.azure-api.net/api/fhir/sign","responseCode":200,"responseSize":93,"cache":"none","apiId":"providertemplate","operationId":"get-sign","clientProtocol":"HTTP/1.1","apiRevision":"1","clientTlsVersion":"1.2","requestHeaders":{"Authorization":"Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiIiLCJpc3QiOiIiLCJpYXQiOjE2NjMzNDI3NjAsImV4cCI6MTY2MzM0Mjc5MCwianRpIjoiand0X25vbmNlIiwidXNlcm5hbWUiOiJ0ZXN0dXNlcjMifQ.3yEz3toqK4yEAfec0Xa7A2UMYg4g0FOk4usNh-nFI"}}, "resourceId": "/SUBSCRIPTIONS/ABDE4308-7B2A-43F0-8557-3ACFFB6BEE0B/RESOURCEGROUPS/RG-OAGELK-DEV-001/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/APIM-OAGELK-DEV-002"}