Closed michalpristas closed 2 weeks ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
@michalpristas I suspect this is doable via regular expression, and any converter would be doing regex internally. Can this be accomplished with ExtractPatterns
?
it can, the reason i'm bringing this separately is that in our pipelines this is a processor very commonly used and having to play with regular expressions may be overwhelming a bit.
also we extract a bit more informations out of user agent string which do not have SemConv counterparts just yet. but i believe it will be on par over time. information like user_agent.device.name
or user_agent.os.name
I don't think it would be feasible to do user agent string extraction just with the ExtractPatterns
converter. Elasticsearch's user agent processor maintains thousands of lines of regexes that sometimes (but rarely) need updates. The upstream of that file is in https://github.com/ua-parser/uap-core. Also, caching is essential to ensure decent performance. It avoids having to parse the same user agent string over and over again for different events by caching 1000 user agent strings in an LRU cache, by default.
There's a go library that we could potentially re-use for this: https://github.com/ua-parser/uap-go.
As user agent parsing yields multiple values, I'm not sure whether OTTL or a separate processor is the right place for user agent string parsing. I think it would be neat if it's possible to build a log parsing pipeline purely in OTTL, including UA parsing and other things that may yield multiple values but I'm not sure what the guideline and the scope of OTTL is.
To me, user agent string parsing feels an essential building block that should be available to users out of the box, one way or another.
Hello, I am interested in working on this, if you are looking for a volunteer (it would be my first contribution to OTel :smile: )
+1 to adding this function to OTTL.
My use case: I was building an Otel collector configuration to parse the logs from an NGINX Ingress Controller (log format) using the filelog
receiver + the transform
processor. As the log has a standard format, the transform processor uses the ExtractPatterns function to get them (http.request.status_code, remote.address, etc.). The issue is that the UserAgent field comes as unstructured plain text, and due to its vast amount of formats (e.g. operating system names) it is very complex to correctly extract every possible subfield using regular expressions.
Assigned the issue to you @pchila :+1:
Thanks for volunteering to work on this @pchila. I'd suggest waiting until the discussion needed
label is removed before continuing work on this to prevent unnecessary work from any design changes that come out of additional discussion.
I'm okay with adding a function like this given we have an implementation that parses a standard format user agent strings (I think RFC 9910 is the official source right now) into a standard map-like structure, such as the attributes provided by semconv. I think following something close to what we have for the URL
Converter that was recently added makes sense.
@felixbarny That's a good note that caching would be helpful here to improve performance. I think we should save that in a follow-up after this function is implemented to keep the each PR small and to ensure we get caching right.
Thank you @evan-bradley , I will have a look at what is done for URL
Converter as for the output structure...
As a first PR I would target having a simple implementation without caching and maybe using https://github.com/ua-parser/uap-go as mentioned by @felixbarny (we can always switch the implementation in the future if we want) along with unit tests and some simple go benchmarks (so we can evaluate better performance impacts of caching and future changes in the implementation).
I will comment here sketching out the input/output of the function before starting implementation.
So, I had a look at what's been implemented for the URL
Converter and a User-Agent parser is very similar, so I would implement the first iteration using https://github.com/ua-parser/uap-go (to keep the PR small while still providing some functionality) and then map the output to the relevant semconv attributes .
I think this should be enough as a first PR introducing the function, we can iterate over that with caching/swapping out internals/mapping additional attributes in follow-up PRs...
@TylerHelmuth would such first implementation would be ok in your opinion or we still need more details/clarification?
When we first started discussing this function we didn't have https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/ottlfuncs/README.md#adding-new-editorsconverters in place. I think this function does meet the acceptance guidelines since it is a significant user experience improvement and potentially a performance improvement.
I am worried about the resulting semantic convention attributes we'd be producing not being stable. When they stabilize the function could break. We'll want to clearly document the current semantic convention version it is following. Maybe the semantic convention version to generate should be an optional param.
Hi @pchila @TylerHelmuth, are we good to close this issue now that https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34172 has been merged or is there some more work to be done?
Are there any follow up enhancements that we're planning? For example, adding caching or supporting more attributes, like the ones the Elasticsearch user_agent
supports?
Component(s)
pkg/ottl
Is your feature request related to a problem? Please describe.
The intended converter extracts details from the user agent string a browser sends with its web requests into User Agent SemConv attributes.
Describe the solution you'd like
Example:
Input:
Result
Describe alternatives you've considered
No response
Additional context
No response