projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
908 stars 119 forks source link

Do not allow special characters in base table locations #8524

Open dimas-b opened 1 month ago

dimas-b commented 1 month ago

Issue description

Coming from discussions on #8516.

Related to:

adutra commented 1 month ago

Do we have a definition of "special characters"? Would it be enough to apply e.g. percent encoding to all base locations?

snazy commented 1 month ago

I suspect this can become pretty nasty later - some locations (provided "externally") are "properly escaped" and some are not. Ideally, we/Nessie should store "properly escaped" locations - the big question is probably: how do we know when a location that's provided "to us/Nessie" is already escaped (so we don't escape it twice or more often).

dimas-b commented 1 month ago

I do not think URL encoding is interoperable with URI, Iceberg's S3URI and AWS S3Utilities... sadly... cf. https://github.com/apache/iceberg/pull/10329#discussion_r1599215769

dimas-b commented 1 month ago

I think we should permit only unreserved (per RFC 3986) chars in base locations.

snazy commented 1 month ago

Ah, you're right. Then we can only implement "our own" "safe escaping" for in namespace/content-key elements. Forbidding would then mean that you cannot create tables or views (just because of such a character). I suspect, we have to work on some rules for this - something like:

Some like that?

dimas-b commented 1 month ago

Unicode is tricky too... but something that works transparently with java.net.URI is probably ok.

dimas-b commented 1 month ago

I think the transformation does not have to be reversible.

adutra commented 1 month ago

Are we ok with a destructive encoding function? I.e. if both foo# and foo? become foo (or foo_), the encoded result becomes ambiguous.

adutra commented 1 month ago

https://en.wikipedia.org/wiki/Punycode ?

dimas-b commented 1 month ago

Exactly what I was thinking :) Is there a good OSS impl.?

snazy commented 1 month ago

destructive encoding function / encoded result becomes ambiguous

Ugh - true. However, entities have their Iceberg-UUID in the name - so it should:tm: not be ambiguous? But your point's still valid.

I suspect we have to rigorously forbid special-chars in object-store locations (as in org.projectnessie.catalog.service.config.WarehouseConfig.location()) but map/escape/destrictive-encode content-key elements?

snazy commented 1 month ago

But no matter which encoding we use - we have to think about existing locations (which we must/should not change) and new locations.

snazy commented 1 month ago

Legacy system issues - not nice

dimas-b commented 1 month ago

Existing locations are covered by StorageUri (hopefully). I think if it used to work in Iceberg/Spark, it will keep working with Nessie.

adutra commented 1 month ago

Exactly what I was thinking :) Is there a good OSS impl.?

It seems the jdk has one: https://docs.oracle.com/en%2Fjava%2Fjavase%2F21%2Fdocs%2Fapi%2F%2F/java.base/java/net/IDN.html

dimas-b commented 1 month ago

From my POC the main concern with new locations is that stuff derived from Nessie ContentKey for new tables may still have # and %, which will then break something on the Iceberg side.

dimas-b commented 1 month ago

With Unicode URI.toString() does not percent-encode non-reseved Unicode chars (and is able to parse them back), but URI.toASCIIString() does encode them. The latter will then hit S3 interop problems, I'm afraid.

dimas-b commented 1 month ago

On the other hand, Punycode will make Unicode path elements unreadable to humans in storage paths, which defeats the whole idea of using ContentKey for base locations, WDYT?

adutra commented 1 month ago

Yes, and I'm also concerned by the fact that it encodes all the ASCII characters first, then all the rest after, thus altering the natural sort order of original names. E.g.

äbc -> bc-uia
žbc -> bc-1va
snazy commented 1 month ago

Maybe collations et al?