Open dimas-b opened 1 month ago
Do we have a definition of "special characters"? Would it be enough to apply e.g. percent encoding to all base locations?
I suspect this can become pretty nasty later - some locations (provided "externally") are "properly escaped" and some are not. Ideally, we/Nessie should store "properly escaped" locations - the big question is probably: how do we know when a location that's provided "to us/Nessie" is already escaped (so we don't escape it twice or more often).
I do not think URL encoding is interoperable with URI
, Iceberg's S3URI
and AWS S3Utilities
... sadly... cf. https://github.com/apache/iceberg/pull/10329#discussion_r1599215769
I think we should permit only unreserved (per RFC 3986) chars in base locations.
Ah, you're right. Then we can only implement "our own" "safe escaping" for in namespace/content-key elements. Forbidding would then mean that you cannot create tables or views (just because of such a character). I suspect, we have to work on some rules for this - something like:
_
Some like that?
Unicode is tricky too... but something that works transparently with java.net.URI
is probably ok.
I think the transformation does not have to be reversible.
Are we ok with a destructive encoding function? I.e. if both foo#
and foo?
become foo
(or foo_
), the encoded result becomes ambiguous.
Exactly what I was thinking :) Is there a good OSS impl.?
destructive encoding function / encoded result becomes ambiguous
Ugh - true. However, entities have their Iceberg-UUID in the name - so it should:tm: not be ambiguous? But your point's still valid.
I suspect we have to rigorously forbid special-chars in object-store locations (as in org.projectnessie.catalog.service.config.WarehouseConfig.location()
) but map/escape/destrictive-encode content-key elements?
But no matter which encoding we use - we have to think about existing locations (which we must/should not change) and new locations.
Legacy system issues - not nice
Existing locations are covered by StorageUri
(hopefully). I think if it used to work in Iceberg/Spark, it will keep working with Nessie.
Exactly what I was thinking :) Is there a good OSS impl.?
It seems the jdk has one: https://docs.oracle.com/en%2Fjava%2Fjavase%2F21%2Fdocs%2Fapi%2F%2F/java.base/java/net/IDN.html
From my POC the main concern with new locations is that stuff derived from Nessie ContentKey
for new tables may still have #
and %
, which will then break something on the Iceberg side.
With Unicode URI.toString()
does not percent-encode non-reseved Unicode chars (and is able to parse them back), but URI.toASCIIString()
does encode them. The latter will then hit S3
interop problems, I'm afraid.
On the other hand, Punycode will make Unicode path elements unreadable to humans in storage paths, which defeats the whole idea of using ContentKey
for base locations, WDYT?
Yes, and I'm also concerned by the fact that it encodes all the ASCII characters first, then all the rest after, thus altering the natural sort order of original names. E.g.
äbc -> bc-uia
žbc -> bc-1va
Maybe collations et al?
Issue description
Coming from discussions on #8516.
Related to: