Why use something "close" to instead of just base 64 encoding diagram strings?

plantuml / plantuml

Generate diagrams from textual description

https://plantuml.com

Other

9.73k stars 881 forks source link

Why use something "close" to instead of just base 64 encoding diagram strings? #117

Closed kcolton closed 1 year ago

kcolton commented 6 years ago

Hi there,

I was wondering if there was a particular reason that AsciiEncoder does not just use standard base 64 encoding. Are there important benefits over standard b64 encoding?

The use case where this is problematic is in trying to create a new client (particularly one that is not Java) that constructs properly formed URLs to send to a PlantUML server.

Having to essentially copy, paste and translate the Java code (or the PHP or JS version available in the docs) into Python.

Isn't it just much easier to say the encoded format is:

utf-8
compressed with deflate or brotli
and standard base 64 encoded

That transformation can be implemented in almost every language using off the shelf components and not having to potentially re-implement encoding, add the entire jar, or make a subprocess call, just to generate the encoded string.

arnaudroques commented 6 years ago

Hi, Well you are 100% right, it was definitely a bad decision. The only excuse we have is that it was a looong time ago and at that time the encoding used was not supposed to ever become public.

The bad news is that we cannot change this "legacy" encoding anymore because it's widely used by so many tools.

The good news is that we are currently thinking about extending the encoding by adding a single character header in front of the URL.

So here is our proposal about the new format:

If you intend to use Deflate compression:

1) Encoded in UTF-8 2) Compressed using Deflate algorithm 3) Reencoded in ASCII using Base 64 Encoding with URL and Filename Safe Alphabet ( https://tools.ietf.org/html/rfc3548 ) without padding character 4) Add an extra "0" header character

If you intend to use Brotli compression:

1) Encoded in UTF-8 2) Compressed using Brotli algorithm 3) Reencoded in ASCII using Base 64 Encoding with URL and Filename Safe Alphabet ( https://tools.ietf.org/html/rfc3548 ) without padding character 4) Add an extra "1" header character

This way, the decoder could safely decode "legacy" encoding (because "legacy" never starts by "0" or "1") and regular Base64 encoding using the initial character header. It also allows future extensions by using other character header.

Note that this is only a proposal and currently not implemented. It also slightly differs from what is explained on http://plantuml.com/text-encoding

What do you think about it ?

kcolton commented 6 years ago

@arnaudroques Thanks for the reply. That sounds like all too familiar of a situation 😅. Definitely understand.

Having the "permalinks" is a great feature and breaking backwards compat would definitely be bad with so much already sitting out there and tools generating that format already.

I see that's already been added since I last pulled the repo :)

🚨 Doesn't the new logic break existing Brotli links though?

Doesn't it change leading 0 from: Brotli + Original Custom URL Safe Encoding to: Zlib + new base 64 decode

Just based off a quick compare of what I have checked out from last week vs the new logic. Did the Brotli option never actually go live so no existing links? (I only started to look at the code last week)

https://github.com/plantuml/plantuml/blob/master/src/net/sourceforge/plantuml/code/TranscoderSmart.java#L48

transcoder

Apologies for not being able to look more deeply at release history, the new encoder logic or any tests. At the office atm. :P

arnaudroques commented 6 years ago

Doesn't the new logic break existing Brotli links though?

Yes it does. Fortunately, Brolti encoding has never been officially released or documented. So there are no "permalinks" yet. We were just about to release it, so your suggestion of using Base64 arrives just at the good time :-)

Yes, we have already commited the leading "0" / leading "1" option, but we realized that we have done it too quickly: we figure out that there are some "legacy" Deflate encoding that does start with "1".

So we are going to release yet another version, where the new headers will be different:

0A for Deflate + Base64
0B for Brotli + Base64

Does it sound good to you ?

kcolton commented 6 years ago

Yes it does. Fortunately, Brolti encoding has never been officially released or documented. Phew haha.

My instinct would be to change something else in the URI if for some reason there were not tricky gotchas with it or other constraints. If the only constraint was maintaining backwards compatibility, and objective to create a URI schema that is:

Backwards compatible Permanent means permanent
Extensible - easy to accept new formats
Maintainable - easy to read code and docs - avoid future regrets
Intuitive - needs less docs to use as a client / user of server. (this is the area I think has the biggest impact / high ROI on additional technical effort. base 64 a nice step in that direction)

With all of that, this would be the scheme I would propose.

/render/{base64DiagramData}.{fileFormat}[?compression={optInToCompression}]

The first piece of the URL distinguishes "new" formatted URLs.
The type goes where it would be expected making saving a bit easier and helping tools that guess type by extension, not response Content-type.
Optional ?compression= query parameter, with the default being no compression. I think it's very valuable to have the compression explicitly marked in the URL (whether by prefix, query param, whatever) to be extensible and not have to worry about changing default later. No compression is actually the version I would use for most cases.

Examples:

/png/SyfFKj2rKt3CoKnELR1Io4ZDoSa70000
/svg/SyfFKj2rKt3CoKnELR1Io4ZDoSa70000
/txt/SyfFKj2rKt3CoKnELR1Io4ZDoSa70000
/render/QHN0YXJ0dW1sDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA==.svg
/render/eNpzKC5JLCopzc3h5XLKT1LQtVNwzMlMTlWwUshIzcnJ5+VySM1LAUoDAAfODNo=.png?compression=brotli
/render/eNpzKC5JLCopzc3h5XLKT1LQtVNwzMlMTlWwUshIzcnJ5+VySM1LAUoDAAfODNo=.svg?compression=deflate
/render/eNpzKC5JLCopzc3h5XLKT1LQtVNwzMlMTlWwUshIzcnJ5+VySM1LAUoDAAfODNo=.txt?compression=futurezip

If for some reason only the code generation could be changed, I would do like:

$deflate$xpY2UgOiBoZWxY2UgOiBoZWxsbw0KQGVuZHVtbA==
$brotli$DQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA==
$none$QHN0YXJ0dW1sDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA==

Thoughts? @arnaudroques

If this is interesting - even if as a longer term thing, I could start a PR over the weekend to see it's feasibility.

And yes - I took some inspiration from the bcrypt prefixes :) https://en.wikipedia.org/wiki/Bcrypt

arnaudroques commented 6 years ago

Well, there are two different things, although highly related:

URI schema
Text encoding

Text encoding is just a way of storing information. "PlantUML text encoding" describe how to store some textual diagram description into a String that is easy to transmit. The text encoding does not say anything about what you are going to do with this diagram description. Are you going to print it? To parse it to have a PNG image ? Or a SVG image ? At encoding level, we do not care. Of course, this text encoding has been designed to be mainly used in URL/URI.

So your suggestions about new URI schema are good, but they should be moved there https://github.com/plantuml/plantuml-server/issues because they are related to the HTTP server, not to the core library. The core library does know nothing about URI/URL. We have simply integrated the "PlantUML Text Encoding" into the core library itself because it will help external tools to interoperate with each other.

So I agree with your objective to create a Text Encoding that is (I've added some) :

Backwards compatible - Permanent means permanent
Extensible - easy to accept new formats
Maintainable - easy to read code and docs - avoid future regrets
(not sure about Intuitive - this encoding may look like garbage for human readers)
Compact - Try to be short
Transfert safe - Use only Letter, Digit - and _ character

Inspiration from bcrypt prefixes is ok but $ character is not transfert safe. We could turn it into '-' for example so we may have :

-deflate-xpY2UgOiBoZWxY2UgOiBoZWxsbw0KQGVuZHVtbA
-brotli-DQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA
-none-QHN0YXJ0dW1sDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA

But this is not very compact. I prefer the following ones:

0AxpY2UgOiBoZWxY2UgOiBoZWxsbw0KQGVuZHVtbA
0BDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA
0CQHN0YXJ0dW1sDQpCb2IgLT4gQWxpY2UgOiBoZWxsbw0KQGVuZHVtbA

And what about adding simple hex encoding ? This is even simpler than Base64 to implement.

In version 1.2018.5 we have implemented some stuff, (see https://github.com/plantuml/plantuml/blob/master/src/net/sourceforge/plantuml/code/TranscoderSmart.java ) so that you can test and tell us if you like it.

So back to URL (despite what we have written :-), the following examples are now working:

http://www.plantuml.com/plantuml/png/SyfFKj2rKt3CoKnELR1Io4ZDoSa70000 http://www.plantuml.com/plantuml/uml/-base64-Qm9iIC0-IEFsaWNlIDogaGVsbG8 http://www.plantuml.com/plantuml/uml/0CQm9iIC0-IEFsaWNlIDogaGVsbG8 http://www.plantuml.com/plantuml/uml/-hex-426F62202D3E20416C696365203A2068656C6C6F http://www.plantuml.com/plantuml/uml/0D426F62202D3E20416C696365203A2068656C6C6F http://www.plantuml.com/plantuml/uml/-zlib-c8pPUtC1U3DMyUxOVbBSyEjNyckH (here, we are going to change -zlib- to -deflate- that makes more sense) http://www.plantuml.com/plantuml/uml/0Ac8pPUtC1U3DMyUxOVbBSyEjNyckH

Last examples are not (yet) permanent: the discussion is still in progress :-)

trothwell commented 5 years ago

I've got example on how other systems can impact the URLs.

Quip will mangle pasted URLs where there is a matching pair of '_'. The url will now include ''. I've filed a bug on their side, but since I'm hosting my own rendering, I catch and convert the invalid URL.

mpcjanssen commented 4 years ago

One thing missing in the encoding issue and the plantuml documentation (making this a much bigger issue) is how the plantuml encoding differs from base64. The difference is actually not that big. Where in base64 the mapping array for values 0-63 is:

{ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z  a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 + /}

for plantuml the array is:

 {0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z - _}

Going from one to the other is thus a single string mapping.

arnaudroques commented 4 years ago

It has taken a very long time, but we have updated the documentation :-) https://plantuml.com/en/text-encoding

Finally, we have choose ~ as header for other format that Deflate. Currently, only hex format is officially supported, using ~h as header (For example http://www.plantuml.com/plantuml/uml/~h407374617274756d6c0a416c6963652d3e426f62203a204920616d207573696e67206865780a40656e64756d6c )

We are working on brotli compression right now (the header will probably be ~b ). For brotli, we are not sure wether we should use standard Base64 or PlantUML transformation. We could also use ~d for deflate with standard Base64 instead of current PlantUML transformation.

Does it sound good to you?

ggrossetie commented 3 years ago

Hello!

I'm the creator of kroki.io, a service that provides a unified API on top of popular diagrams libraries including PlantUML.

The Kroki API is using deflate + base64 but I also support the "legacy" encoding using the following code:

String text = URLDecoder.decode(source, "UTF-8");
try {
  Transcoder transcoder = TranscoderUtil.getDefaultTranscoder();
  text = transcoder.decode(text);
} catch (ArrayIndexOutOfBoundsException | IOException e) {
  // Unable to decode with the PlantUML decoder, try the default decoder
  text = DiagramSource.decode(text);
}
return text;

The above code is still working but System.err.println statements were added in the latest versions and now the output is really verbose:

java.io.IOException: java.util.zip.DataFormatException: invalid stored block lengths
    at net.sourceforge.plantuml.code.CompressionZlib.tryDecompress(CompressionZlib.java:130)
    at net.sourceforge.plantuml.code.CompressionZlib.decompress(CompressionZlib.java:92)
    at net.sourceforge.plantuml.code.TranscoderImpl.decode(TranscoderImpl.java:83)
    at net.sourceforge.plantuml.code.TranscoderSmart.decode(TranscoderSmart.java:60)
    at io.kroki.server.decode.DiagramSource.unsafePlantumlDecode(DiagramSource.java:48)
    at io.kroki.server.decode.DiagramSource.plantumlDecode(DiagramSource.java:38)
    at io.kroki.server.service.Plantuml$1.decode(Plantuml.java:100)
    // stacktrace continues...
Cannot decode string
Not Huffman
Cannot decode string

As far as I know, PlantUML does not use a logging library or slf4j so I cannot suppress the errors/warnings. What is the best approach to try to decode the text using the legacy format without all this noise? Since we are trying to guess the encoding I'm not sure that it's a good idea to print the exception to stderr because it's expected that the decode method will throw an exception (if we guessed wrong).

I guess another solution would be to change the order in my code:

String text = URLDecoder.decode(source, "UTF-8");
try {
  // Try the default Kroki decoder
  text = DiagramSource.decode(text);
} catch (ArrayIndexOutOfBoundsException | IOException e) {
  // Unable to decode with the Kroki decoder, try the PlantUML decoder
  Transcoder transcoder = TranscoderUtil.getDefaultTranscoder();
  text = transcoder.decode(text);
}
return text;

Thanks for your help!

arnaudroques commented 3 years ago

We have taken the easiest solution: we have removed the System.err.println in last beta http://beta.plantuml.net/plantuml.jar Does it sound better to you ?

ggrossetie commented 3 years ago

We have taken the easiest solution: we have removed the System.err.println in last beta http://beta.plantuml.net/plantuml.jar Does it sound better to you ?

Sure, it sounds good, thanks Arnaud :+1:

da-kami commented 2 years ago

Note: I am replying here because this ticket is linked in the docs in relation to brotli...

Is support for brotli compression actually implemented and working? The documentation at https://plantuml.com/text-encoding states that one should be able to use it.

But the brotli link that is provided in this section

is compressed to a
    [428-char string length using Deflate](http://www.plantuml.com/plantuml/uml/ZP4zRy8m48Pt_ueJdHawLMMe53iWLK9LLLHrFk8hM04xd9rI-kiR0u8a1CAG8SdpVja-DxP0nZNCCSiNx4ghbLivXeVnU2nJ9Vo9MABLMpOXa8N09OdQFq-Racn62RFR7WnIecAMx-Ig8Zl0B3YMZZNnFVZKV5UFfRfYteEs1oNFgKed7Oftv60oKw0DzpUgYrf9gTCBudxTnDdmXck2rtM1MUY7P-QFuF6f7-nRV3ZzLctSb7YDFPlUSQstgvwG_VG42_eL0kD7-FJ4eZXFWS74i0-WLkZz0D13RMVI96UKEQkxKTb4ftZDKmaHEy3mfP4qggxqot4UQveV3DJi8UflBQqSWMAAae-utuTE3zdma2qFTJDVrQKAXXVvKPawIq9NyUnshS7Du8lbnzh75ReowH-Gx7tYISRc--JEa_i7)
    [393-char string length using Brotli](http://www.plantuml.com/plantuml/uml/06tq404I5ENsC6cXKT6xxgxaBDG3o_tzkTRbkDJuRa4mYLIoIEFVZsapwhAr5NDHB0jrZfWK5MOp8y53KKy_J2adzUr-HCAJ8bVfEA7x6qMwXhNtcUJYCT4gMZV_c2gzJk9gimqo81bOfXLN-tkYpiaWi3aabF_wrItuxPLX5NINL6FKhAboWjmXbI8jiBfIRnXs0h40re09D-HpekC83iDO8GEXFHTCMoPlXmKCx05mLK_fTdsZCJY1geDzQhs6aar6qIIfU1V5QYHQ5wvIFj8v6xZE0zWeZksK6S5mgnjDR9L1ao9uZg7-gFLepB1EeGVeUILgETgbkEtuAelaPmmGV7kEDjEioz1d6D_0WvvYzGbScBwP8CnA9ZosEQ2vGlBk_7W00)

does not seem to be correct:

I was experimenting with getting brotli compression working, but unsuccessful so far. Is there a working example somewhere?

maksugr commented 1 year ago

@da-kami I met the same issue and have found this thread. tldr; brotli is turned off.