snowplow / enrich

Snowplow Enrichment jobs and library
https://snowplowanalytics.com
Other
21 stars 39 forks source link

Cookie Extractor misses cookies behind JSON stringified cookie values #904

Open rowolff opened 1 week ago

rowolff commented 1 week ago

Project: Stream Enrich

Version: 5.0.0

Expected behavior:

Actual behavior:

The extraction only works if there's no stringified JSON in front of the wanted cookie:

Steps to reproduce:

  1. Configure cookie extractor with at least one cookie name
  2. Create a request where the wanted cookie is behind a stringified JSON cookie value (example above)

Example: I reproduced this with Snowplow Micro in this repository: https://github.com/rowolff/snowplow-micro-debugging/

Additional info:

We noticed the bug while upgrading our components. We were running with Collector 2.9.1/Enrich 5.0.0 for a while and then jumped to Collector 3.2.0/Enrich 5.0.0 when we suddenly noticed the issue. Hope this helps.

miike commented 1 week ago

This is most likely a change between 2.x and 3.x of moving from akka-http to http4s (which is far stricter compared to akka which is relatively lax most of the time).

Unfortunately the behaviour is undesired but I think it is likely correct (having it work with changing ordering is unusual though) as the backslash value in your cookie is forbidden in the standard spec as per RFC 6265 where cookie octet must be

 cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                       ; US-ASCII characters excluding CTLs,
                       ; whitespace DQUOTE, comma, semicolon,
                       ; and backslash

As a result the recommendation (for cross browser compatibility) is to base64 encode anything where you expected disallowed characters to occur.

   To maximize compatibility with user agents, servers that wish to
   store arbitrary data in a cookie-value SHOULD encode that data, for
   example, using Base64 [RFC4648].
rowolff commented 1 week ago

Hi @miike - thank you for the quick response and the awesome detective work. I'll check with my team if and what we can do about it. Some JSON strings come from 3rd party tools and we're not in control of how they are formatted, so it might take some time to resolve that.

miike commented 1 week ago

No worries. I can see how third party cookies could definitely be problematic and difficult to modify (or get encoded correctly).

There may be some good news in that it looks like this is by no means the first time folks have run into this issue with http4s and as a result there is a PR that adds a "RelaxedCookies" mode - and the test seem to include some JSON. I haven't tested this as I'm assuming it's an issue with the collector rather than enrich - but that seems a reasonable bet if the same version of enrich demonstrates different behaviour between 2.9.1 and 3.2.0.

I've raised this with the engineering team to have a closer look and see what we might be able to do - thank you for flagging this one!