Closed kcolton closed 1 year ago
Hi, Well you are 100% right, it was definitely a bad decision. The only excuse we have is that it was a looong time ago and at that time the encoding used was not supposed to ever become public.
The bad news is that we cannot change this "legacy" encoding anymore because it's widely used by so many tools.
The good news is that we are currently thinking about extending the encoding by adding a single character header in front of the URL.
So here is our proposal about the new format:
If you intend to use Deflate compression:
1) Encoded in UTF-8 2) Compressed using Deflate algorithm 3) Reencoded in ASCII using Base 64 Encoding with URL and Filename Safe Alphabet ( https://tools.ietf.org/html/rfc3548 ) without padding character 4) Add an extra "0" header character
If you intend to use Brotli compression:
1) Encoded in UTF-8 2) Compressed using Brotli algorithm 3) Reencoded in ASCII using Base 64 Encoding with URL and Filename Safe Alphabet ( https://tools.ietf.org/html/rfc3548 ) without padding character 4) Add an extra "1" header character
This way, the decoder could safely decode "legacy" encoding (because "legacy" never starts by "0" or "1") and regular Base64 encoding using the initial character header. It also allows future extensions by using other character header.
Note that this is only a proposal and currently not implemented. It also slightly differs from what is explained on http://plantuml.com/text-encoding
What do you think about it ?
@arnaudroques Thanks for the reply. That sounds like all too familiar of a situation 😅. Definitely understand.
Having the "permalinks" is a great feature and breaking backwards compat would definitely be bad with so much already sitting out there and tools generating that format already.
I see that's already been added since I last pulled the repo :)
🚨 Doesn't the new logic break existing Brotli links though?
Doesn't it change leading 0 from: Brotli + Original Custom URL Safe Encoding to: Zlib + new base 64 decode
Just based off a quick compare of what I have checked out from last week vs the new logic. Did the Brotli option never actually go live so no existing links? (I only started to look at the code last week)
Apologies for not being able to look more deeply at release history, the new encoder logic or any tests. At the office atm. :P
Doesn't the new logic break existing Brotli links though?
Yes it does. Fortunately, Brolti encoding has never been officially released or documented. So there are no "permalinks" yet. We were just about to release it, so your suggestion of using Base64 arrives just at the good time :-)
Yes, we have already commited the leading "0" / leading "1" option, but we realized that we have done it too quickly: we figure out that there are some "legacy" Deflate encoding that does start with "1".
So we are going to release yet another version, where the new headers will be different:
0A
for Deflate + Base640B
for Brotli + Base64Does it sound good to you ?
Yes it does. Fortunately, Brolti encoding has never been officially released or documented. Phew haha.
My instinct would be to change something else in the URI if for some reason there were not tricky gotchas with it or other constraints. If the only constraint was maintaining backwards compatibility, and objective to create a URI schema that is:
With all of that, this would be the scheme I would propose.
/render/{base64DiagramData}.{fileFormat}[?compression={optInToCompression}]
?compression=
query parameter, with the default being no compression. I think it's very valuable to have the compression explicitly marked in the URL (whether by prefix, query param, whatever) to be extensible and not have to worry about changing default later. No compression is actually the version I would use for most cases.Examples:
/png/SyfFKj2rKt3CoKnELR1Io4ZDoSa70000
/svg/SyfFKj2rKt3CoKnELR1Io4ZDoSa70000
/txt/SyfFKj2rKt3CoKnELR1Io4ZDoSa70000
/render/QHN0YXJ0dW1sDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA==.svg
/render/eNpzKC5JLCopzc3h5XLKT1LQtVNwzMlMTlWwUshIzcnJ5+VySM1LAUoDAAfODNo=.png?compression=brotli
/render/eNpzKC5JLCopzc3h5XLKT1LQtVNwzMlMTlWwUshIzcnJ5+VySM1LAUoDAAfODNo=.svg?compression=deflate
/render/eNpzKC5JLCopzc3h5XLKT1LQtVNwzMlMTlWwUshIzcnJ5+VySM1LAUoDAAfODNo=.txt?compression=futurezip
If for some reason only the code
generation could be changed, I would do like:
$deflate$xpY2UgOiBoZWxY2UgOiBoZWxsbw0KQGVuZHVtbA==
$brotli$DQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA==
$none$QHN0YXJ0dW1sDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA==
Thoughts? @arnaudroques
If this is interesting - even if as a longer term thing, I could start a PR over the weekend to see it's feasibility.
And yes - I took some inspiration from the bcrypt prefixes :) https://en.wikipedia.org/wiki/Bcrypt
Well, there are two different things, although highly related:
Text encoding is just a way of storing information. "PlantUML text encoding" describe how to store some textual diagram description into a String that is easy to transmit. The text encoding does not say anything about what you are going to do with this diagram description. Are you going to print it? To parse it to have a PNG image ? Or a SVG image ? At encoding level, we do not care. Of course, this text encoding has been designed to be mainly used in URL/URI.
So your suggestions about new URI schema are good, but they should be moved there https://github.com/plantuml/plantuml-server/issues because they are related to the HTTP server, not to the core library. The core library does know nothing about URI/URL. We have simply integrated the "PlantUML Text Encoding" into the core library itself because it will help external tools to interoperate with each other.
So I agree with your objective to create a Text Encoding that is (I've added some) :
Inspiration from bcrypt prefixes is ok but $ character is not transfert safe. We could turn it into '-' for example so we may have :
-deflate-xpY2UgOiBoZWxY2UgOiBoZWxsbw0KQGVuZHVtbA
-brotli-DQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA
-none-QHN0YXJ0dW1sDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA
But this is not very compact. I prefer the following ones:
0AxpY2UgOiBoZWxY2UgOiBoZWxsbw0KQGVuZHVtbA
0BDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA
0CQHN0YXJ0dW1sDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA
And what about adding simple hex encoding ? This is even simpler than Base64 to implement.
In version 1.2018.5 we have implemented some stuff, (see https://github.com/plantuml/plantuml/blob/master/src/net/sourceforge/plantuml/code/TranscoderSmart.java ) so that you can test and tell us if you like it.
So back to URL (despite what we have written :-), the following examples are now working:
http://www.plantuml.com/plantuml/png/SyfFKj2rKt3CoKnELR1Io4ZDoSa70000 http://www.plantuml.com/plantuml/uml/-base64-Qm9iIC0-IEFsaWNlIDogaGVsbG8 http://www.plantuml.com/plantuml/uml/0CQm9iIC0-IEFsaWNlIDogaGVsbG8 http://www.plantuml.com/plantuml/uml/-hex-426F62202D3E20416C696365203A2068656C6C6F http://www.plantuml.com/plantuml/uml/0D426F62202D3E20416C696365203A2068656C6C6F http://www.plantuml.com/plantuml/uml/-zlib-c8pPUtC1U3DMyUxOVbBSyEjNyckH (here, we are going to change -zlib- to -deflate- that makes more sense) http://www.plantuml.com/plantuml/uml/0Ac8pPUtC1U3DMyUxOVbBSyEjNyckH
Last examples are not (yet) permanent: the discussion is still in progress :-)
I've got example on how other systems can impact the URLs.
Quip will mangle pasted URLs where there is a matching pair of '_'. The url will now include ''. I've filed a bug on their side, but since I'm hosting my own rendering, I catch and convert the invalid URL.
One thing missing in the encoding issue and the plantuml documentation (making this a much bigger issue) is how the plantuml encoding differs from base64. The difference is actually not that big. Where in base64 the mapping array for values 0-63 is:
{ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 + /}
for plantuml the array is:
{0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z - _}
Going from one to the other is thus a single string mapping.
It has taken a very long time, but we have updated the documentation :-) https://plantuml.com/en/text-encoding
Finally, we have choose ~
as header for other format that Deflate.
Currently, only hex format is officially supported, using ~h
as header
(For example http://www.plantuml.com/plantuml/uml/~h407374617274756d6c0a416c6963652d3e426f62203a204920616d207573696e67206865780a40656e64756d6c )
We are working on brotli compression right now (the header will probably be ~b
). For brotli, we are not sure wether we should use standard Base64 or PlantUML transformation.
We could also use ~d
for deflate with standard Base64 instead of current PlantUML transformation.
Does it sound good to you?
Hello!
I'm the creator of kroki.io, a service that provides a unified API on top of popular diagrams libraries including PlantUML.
The Kroki API is using deflate + base64 but I also support the "legacy" encoding using the following code:
String text = URLDecoder.decode(source, "UTF-8");
try {
Transcoder transcoder = TranscoderUtil.getDefaultTranscoder();
text = transcoder.decode(text);
} catch (ArrayIndexOutOfBoundsException | IOException e) {
// Unable to decode with the PlantUML decoder, try the default decoder
text = DiagramSource.decode(text);
}
return text;
The above code is still working but System.err.println
statements were added in the latest versions and now the output is really verbose:
java.io.IOException: java.util.zip.DataFormatException: invalid stored block lengths
at net.sourceforge.plantuml.code.CompressionZlib.tryDecompress(CompressionZlib.java:130)
at net.sourceforge.plantuml.code.CompressionZlib.decompress(CompressionZlib.java:92)
at net.sourceforge.plantuml.code.TranscoderImpl.decode(TranscoderImpl.java:83)
at net.sourceforge.plantuml.code.TranscoderSmart.decode(TranscoderSmart.java:60)
at io.kroki.server.decode.DiagramSource.unsafePlantumlDecode(DiagramSource.java:48)
at io.kroki.server.decode.DiagramSource.plantumlDecode(DiagramSource.java:38)
at io.kroki.server.service.Plantuml$1.decode(Plantuml.java:100)
// stacktrace continues...
Cannot decode string
Not Huffman
Cannot decode string
As far as I know, PlantUML does not use a logging library or slf4j so I cannot suppress the errors/warnings. What is the best approach to try to decode the text using the legacy format without all this noise? Since we are trying to guess the encoding I'm not sure that it's a good idea to print the exception to stderr because it's expected that the decode method will throw an exception (if we guessed wrong).
I guess another solution would be to change the order in my code:
String text = URLDecoder.decode(source, "UTF-8");
try {
// Try the default Kroki decoder
text = DiagramSource.decode(text);
} catch (ArrayIndexOutOfBoundsException | IOException e) {
// Unable to decode with the Kroki decoder, try the PlantUML decoder
Transcoder transcoder = TranscoderUtil.getDefaultTranscoder();
text = transcoder.decode(text);
}
return text;
Thanks for your help!
We have taken the easiest solution: we have removed the System.err.println
in last beta http://beta.plantuml.net/plantuml.jar
Does it sound better to you ?
We have taken the easiest solution: we have removed the System.err.println in last beta http://beta.plantuml.net/plantuml.jar Does it sound better to you ?
Sure, it sounds good, thanks Arnaud :+1:
Note: I am replying here because this ticket is linked in the docs in relation to brotli...
Is support for brotli
compression actually implemented and working?
The documentation at https://plantuml.com/text-encoding states that one should be able to use it.
But the brotli
link that is provided in this section
is compressed to a
[428-char string length using Deflate](http://www.plantuml.com/plantuml/uml/ZP4zRy8m48Pt_ueJdHawLMMe53iWLK9LLLHrFk8hM04xd9rI-kiR0u8a1CAG8SdpVja-DxP0nZNCCSiNx4ghbLivXeVnU2nJ9Vo9MABLMpOXa8N09OdQFq-Racn62RFR7WnIecAMx-Ig8Zl0B3YMZZNnFVZKV5UFfRfYteEs1oNFgKed7Oftv60oKw0DzpUgYrf9gTCBudxTnDdmXck2rtM1MUY7P-QFuF6f7-nRV3ZzLctSb7YDFPlUSQstgvwG_VG42_eL0kD7-FJ4eZXFWS74i0-WLkZz0D13RMVI96UKEQkxKTb4ftZDKmaHEy3mfP4qggxqot4UQveV3DJi8UflBQqSWMAAae-utuTE3zdma2qFTJDVrQKAXXVvKPawIq9NyUnshS7Du8lbnzh75ReowH-Gx7tYISRc--JEa_i7)
[393-char string length using Brotli](http://www.plantuml.com/plantuml/uml/06tq404I5ENsC6cXKT6xxgxaBDG3o_tzkTRbkDJuRa4mYLIoIEFVZsapwhAr5NDHB0jrZfWK5MOp8y53KKy_J2adzUr-HCAJ8bVfEA7x6qMwXhNtcUJYCT4gMZV_c2gzJk9gimqo81bOfXLN-tkYpiaWi3aabF_wrItuxPLX5NINL6FKhAboWjmXbI8jiBfIRnXs0h40re09D-HpekC83iDO8GEXFHTCMoPlXmKCx05mLK_fTdsZCJY1geDzQhs6aar6qIIfU1V5QYHQ5wvIFj8v6xZE0zWeZksK6S5mgnjDR9L1ao9uZg7-gFLepB1EeGVeUILgETgbkEtuAelaPmmGV7kEDjEioz1d6D_0WvvYzGbScBwP8CnA9ZosEQ2vGlBk_7W00)
does not seem to be correct:
I was experimenting with getting brotli compression working, but unsuccessful so far. Is there a working example somewhere?
Hi there,
I was wondering if there was a particular reason that
AsciiEncoder
does not just use standard base 64 encoding. Are there important benefits over standard b64 encoding?The use case where this is problematic is in trying to create a new client (particularly one that is not Java) that constructs properly formed URLs to send to a PlantUML server.
Having to essentially copy, paste and translate the Java code (or the PHP or JS version available in the docs) into Python.
Isn't it just much easier to say the encoded format is:
That transformation can be implemented in almost every language using off the shelf components and not having to potentially re-implement encoding, add the entire jar, or make a subprocess call, just to generate the encoded string.