[pkg/ottl] Support for extracting UserAgent string

michalpristas commented 5 months ago

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

The intended converter extracts details from the user agent string a browser sends with its web requests into User Agent SemConv attributes.

Describe the solution you'd like

Example:

Input:

{
  "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}

Result

{
  "user_agent": {
      "name": "Chrome",
      "original": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
      "version": "51.0.2704.103",
  }
}

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 5 months ago

Pinging code owners:

pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley

See Adding Labels via Comments if you do not have permissions to add labels yourself.

TylerHelmuth commented 4 months ago

@michalpristas I suspect this is doable via regular expression, and any converter would be doing regex internally. Can this be accomplished with ExtractPatterns?

michalpristas commented 4 months ago

it can, the reason i'm bringing this separately is that in our pipelines this is a processor very commonly used and having to play with regular expressions may be overwhelming a bit. also we extract a bit more informations out of user agent string which do not have SemConv counterparts just yet. but i believe it will be on par over time. information like user_agent.device.name or user_agent.os.name

felixbarny commented 4 months ago

I don't think it would be feasible to do user agent string extraction just with the ExtractPatterns converter. Elasticsearch's user agent processor maintains thousands of lines of regexes that sometimes (but rarely) need updates. The upstream of that file is in https://github.com/ua-parser/uap-core. Also, caching is essential to ensure decent performance. It avoids having to parse the same user agent string over and over again for different events by caching 1000 user agent strings in an LRU cache, by default.

There's a go library that we could potentially re-use for this: https://github.com/ua-parser/uap-go.

As user agent parsing yields multiple values, I'm not sure whether OTTL or a separate processor is the right place for user agent string parsing. I think it would be neat if it's possible to build a log parsing pipeline purely in OTTL, including UA parsing and other things that may yield multiple values but I'm not sure what the guideline and the scope of OTTL is.

To me, user agent string parsing feels an essential building block that should be available to users out of the box, one way or another.

pchila commented 2 months ago

Hello, I am interested in working on this, if you are looking for a volunteer (it would be my first contribution to OTel :smile: )

rogercoll commented 2 months ago

+1 to adding this function to OTTL. My use case: I was building an Otel collector configuration to parse the logs from an NGINX Ingress Controller (log format) using the filelog receiver + the transform processor. As the log has a standard format, the transform processor uses the ExtractPatterns function to get them (http.request.status_code, remote.address, etc.). The issue is that the UserAgent field comes as unstructured plain text, and due to its vast amount of formats (e.g. operating system names) it is very complex to correctly extract every possible subfield using regular expressions.

andrzej-stencel commented 2 months ago

Assigned the issue to you @pchila :+1:

evan-bradley commented 2 months ago

Thanks for volunteering to work on this @pchila. I'd suggest waiting until the discussion needed label is removed before continuing work on this to prevent unnecessary work from any design changes that come out of additional discussion.

I'm okay with adding a function like this given we have an implementation that parses a standard format user agent strings (I think RFC 9910 is the official source right now) into a standard map-like structure, such as the attributes provided by semconv. I think following something close to what we have for the URL Converter that was recently added makes sense.

@felixbarny That's a good note that caching would be helpful here to improve performance. I think we should save that in a follow-up after this function is implemented to keep the each PR small and to ensure we get caching right.

pchila commented 2 months ago

Thank you @evan-bradley , I will have a look at what is done for URL Converter as for the output structure... As a first PR I would target having a simple implementation without caching and maybe using https://github.com/ua-parser/uap-go as mentioned by @felixbarny (we can always switch the implementation in the future if we want) along with unit tests and some simple go benchmarks (so we can evaluate better performance impacts of caching and future changes in the implementation).

I will comment here sketching out the input/output of the function before starting implementation.

pchila commented 2 months ago

So, I had a look at what's been implemented for the URL Converter and a User-Agent parser is very similar, so I would implement the first iteration using https://github.com/ua-parser/uap-go (to keep the PR small while still providing some functionality) and then map the output to the relevant semconv attributes . I think this should be enough as a first PR introducing the function, we can iterate over that with caching/swapping out internals/mapping additional attributes in follow-up PRs...

@TylerHelmuth would such first implementation would be ok in your opinion or we still need more details/clarification?

TylerHelmuth commented 1 month ago

When we first started discussing this function we didn't have https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/ottlfuncs/README.md#adding-new-editorsconverters in place. I think this function does meet the acceptance guidelines since it is a significant user experience improvement and potentially a performance improvement.

I am worried about the resulting semantic convention attributes we'd be producing not being stable. When they stabilize the function could break. We'll want to clearly document the current semantic convention version it is following. Maybe the semantic convention version to generate should be an optional param.

ycombinator commented 2 weeks ago

Hi @pchila @TylerHelmuth, are we good to close this issue now that https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34172 has been merged or is there some more work to be done?

felixbarny commented 2 weeks ago

Are there any follow up enhancements that we're planning? For example, adding caching or supporting more attributes, like the ones the Elasticsearch user_agent supports?

open-telemetry / opentelemetry-collector-contrib