spring-projects / spring-hateoas

Spring HATEOAS - Library to support implementing representations for hyper-text driven REST web services.
https://spring.io/projects/spring-hateoas
Apache License 2.0
1.04k stars 478 forks source link

Support advanced Link header expressions #2099

Closed viliam-durina closed 3 months ago

viliam-durina commented 7 months ago

We've decided to use your library to parse Link headers, hoping that you'll correctly implement the specification intricacies, because doing parsing correctly is tricky. However, it's implemented completely naively, working only in the basic cases. It uses regex for parsing, even though it can't be used for non-context-free grammars.

Here are a bunch of examples:

// should have failed
Link.valueOf("foo bar <url>;rel=\"next\"");

// greedy capture until the last `>`
System.out.println(Link.valueOf("<url>;customparam=foo>;rel=\"next\"").getHref());

// multiple rels not supported, should have been parsed to a collection
System.out.println(Link.valueOf("<url>;rel=\"next last\"").getRel().value());

// incorrect unquoting of values - missing end quote must be ignored, but currently "nex" is returned (missing last "t")
System.out.println(Link.valueOf("<url>;rel=\"next").getRel().value());

// param value not unescaped
System.out.println(Link.valueOf("<url>;rel=\"next\";title=\"foo\\\"bar\"").getTitle());

// incorrect semicolon handling - a semicolon within a quoted value is not treated literally
System.out.println(Link.valueOf("<url>;rel=\"next\";title=\"foo;bar\"").getRel().value());

This is the ABNF for the Link header:

  Link       = #link-value
  link-value = "<" URI-Reference ">" *( OWS ";" OWS link-param )
  link-param = token BWS [ "=" BWS ( token / quoted-string ) ]

And here for the content of the rel param:

  relation-type *( 1*SP relation-type )
where:

  relation-type  = reg-rel-type / ext-rel-type
  reg-rel-type   = LOALPHA *( LOALPHA / DIGIT / "." / "-" )
  ext-rel-type   = URI ; Section 3 of [RFC3986]

See https://httpwg.org/specs/rfc8288.html

So there can be multiple values in the rel param, even URIs and quoted URIs.

Some of the errors can't be fixed without breaking b-w compatibility, e.g. the decoding of href, unescaping of params and multiple rel values. Also, I'm not sure why custom parameters aren't supported - I didn't read the whole spec, but I don't think it prohibits custom params (we don't use them, just wondering).

viliam-durina commented 7 months ago

Looking at the RFC more deeply, looks that I'm missing a lot. For example, missing > or missing final quote " MUST be ignored. There's nothing about URL-decoding the href, at least I can't find it. Seems that the particular service I'm working with now got this incorrectly since they are double-escaping their URL, contrary to the standard. There also can be multiple links within a single header, separated by ,, if I understand it correctly.

But many of the points above still apply. The RFC contains precise algorithm for parsing, I would expect a high quality library to strictly follow it, including the edge cases.

odrotbohm commented 7 months ago

Would you mind listing the cases that really do break and are valid link headers according to the specification?

viliam-durina commented 7 months ago

@odrotbohm I edited the original post of this issue. The list isn't complete, to implement correctly, one should follow the specification very closely. There's exact algorithm for the parsing an Appendix B, from which handling of all (most?) edge cases follows (e.g. the ignoring of the missing final quote, if I understand it correctly).

odrotbohm commented 7 months ago

Would you mind turning them into test cases and submitting a PR? I'll see what I can do then. If you fancy taking a stab at a better parsing implementation, feel free to submit that, too.

odrotbohm commented 3 months ago

This is fixed by merging and polishing your PR. Also back-ported to 2.3.x and 2.2.x.

viliam-durina commented 3 months ago

How you renamed the issue made my day: so it wasn't a bug, but a new, advanced feature?? I renamed it back, unfortunatelly, I can't rename the commit...

We needed to parse a Link. My junior colleague wanted to do a quick-and-dirty parser. I told to him that there's an RFC, rather look for a lib because you won't make it correctly. We found this lib, but realized that it had exactly the same issues I worried a junior would make. And now you renamed it to "support for advanced expressions". Is a semicolon in a String literal advanced expression? Just tell me, would you really say so?