tremor-rs / tremor-runtime

Main Tremor Project Rust Codebase
https://www.tremor.rs
Apache License 2.0
852 stars 125 forks source link

Extra empty path segments in `EventOriginUri`s #183

Closed m0n0chr0m3 closed 4 years ago

m0n0chr0m3 commented 4 years ago

Problem The path method on EventOriginUris behaves different from what I would have expected, and yields different results from what the uri crate's path_segments yields: invoking path() on paths of the form "scheme://host.name/" yields a vector of two empty strings, where I'd expect only one.

Steps Consider the behavior at https://github.com/m0n0chr0m3/tremor-runtime/commit/3798d4383a7f8488bbf16042e927ce74ae703dc9#diff-bdd33b2308718bd57f53842f6df9fce6R166. The assertions as written there succeed in the current implementation of tremor-script. I'd expect the version in the comment below it. Similarly for the rest of that test case.

For reference, the behavior of the url crate's path_segments method on those inputs is as follows:

"protocol://the.host.name": None
"protocol://the.host.name:8080": None
"protocol://the.host.name/": Some([""])
"protocol://the.host.name/some/path/segments": Some(["some", "path", "segments"])
"protocol://the.host.name/some/path/segments/": Some(["some", "path", "segments", ""])
"protocol://host.names.are.%F0%9F%94%A5": None

(as demonstrated on the playground)

As far as I see, none of the unit tests currently check any of the origin namespace Tremor functions. The Tremor test harness for unit tests runs all scripts/pipelines/etc. in a context where the origin is not set, so all functions would return null anyway. I hence cannot gauge the impact of 'fixing' the path method.

Possible Solution(s)

  1. If the behavior is correct as-is, close this bug report.
  2. If the behavior is wrong, and the path segments for"scheme://host/" should be vec![""], and the path segments for "scheme://host" should be vec![""] too, tell me and I'll open a PR.
  3. If the behavior is wrong, and the path segments for"scheme://host/" should be vec![""], and the path segments for "scheme://host" should be vec![], tell me and I'll open a PR.
  4. If the behavior is wrong, and the path segments for"scheme://host/" should be Some(vec![""]), and the path segments for "scheme://host" should be None, tell me and I'll open a PR.
  5. If the behavior is wrong, but none of the above, tell me what's expected and I'll look into it. ;)

Notes

Output of tremor-server --version:

tremor version: 0.7.3
rd_kafka version: 0x000000ff, 1.3.0
tremor-runtime 0.7.3
anupdhml commented 4 years ago

Great catch! The behavior you are seeing is definitely unexpected.

I see that the tests in your branch start off from EventOriginUri::parse method, and that is actually not in use anywhere else right now (unless some non-tremor project picked it up lately πŸ˜ƒ ). From tremor onramps, EventOriginUris are initialized directly (example) and in that context, the path method will reflect whatever the onramp sets.

We should fix the parse functionality though. I like the option 3 you proposed -- compared to 2, it gives users the ability to distinguish between "scheme://host/" and "scheme://host" for path, in case they need it. 4 would also achieve the same but 3 is more consistent with how path is being set from onramps and vec![] sufficiently handles the None path case without needing to make it an optional type (4 would also open up Some(vec![]) as a potential value for path which is one too many option for what we need here).

Assuming we go with option 3, could you also change the Display implementation of EventOriginUri to be consistent with it? Right now, it outputs "scheme://host/" for both path cases (vec![] and vec![""]): https://github.com/wayfair-tremor/tremor-runtime/blob/v0.7.3/tremor-script/src/ctx.rs#L86

Thanks for working on adding the tests -- much appreciated!

m0n0chr0m3 commented 4 years ago

Thanks for the feedback.

Option 4 is indeed a bit weird, but it's what the url crate does.

I'll prepare a version based on option 3, with the Display implementation fixed and verified by the tests.

m0n0chr0m3 commented 4 years ago

Follow-up question: should EventOriginUri::default() result in a URI "tremor-script://localhost/" or a "tremor-script://localhost" (i.e., with an empty path)?

Licenser commented 4 years ago

It might be worth considering the RFC for URI's when looking at this.

It states:

A path is always defined for a URI, though the defined path may be empty (zero length).

Which rules out the use of Option the way I read it.

Looking further at the grammar:

      path          = path-abempty    ; begins with "/" or is empty
                    / path-absolute   ; begins with "/" but not "//"
                    / path-noscheme   ; begins with a non-colon segment
                    / path-rootless   ; begins with a segment
                    / path-empty      ; zero characters

      path-abempty  = *( "/" segment )
      path-absolute = "/" [ segment-nz *( "/" segment ) ]
      path-noscheme = segment-nz-nc *( "/" segment )
      path-rootless = segment-nz *( "/" segment )
      path-empty    = 0<pchar>
      segment       = *pchar
      segment-nz    = 1*pchar
      segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":"
      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

The important bits are

      path-abempty  = *( "/" segment )
      path-absolute = "/" [ segment-nz *( "/" segment ) ]
      segment       = *pchar

In my opinion as it answers the question if an empty path is allowed, the differentiation between segment and segment-nz indicates to me that yes they en empty path element as the last element of a path si allowed and distinctly different from no / at the end so

"protocol://the.host.name": None
"protocol://the.host.name/": Some([""])

Are different URI's.

Taking those into account I would say that 3 is the RFC-conform implementation.

anupdhml commented 4 years ago

Follow-up question: should EventOriginUri::default() result in a URI "tremor-script://localhost/" or a "tremor-script://localhost" (i.e., with an empty path)?

Since there's no info to convey by setting path here, I would prefer to leave it empty.

@Licenser good call on the rfc!

m0n0chr0m3 commented 4 years ago

Indeed, url's behavior is seemingly at odds with the (proposed) specification and RFC. I didn't feel like second-guessing a dependency without input from the core devs.