buffer:set_encoding('UTF-8') causes marking buffer as changed

oOosys commented 8 months ago

To ensure right encoding I have added buffer:set_encoding('UTF-8') to my onBufferChange() handler triggered by

events.connect(events.LEXER_LOADED,     onBufferChange)
events.connect(events.BUFFER_AFTER_SWITCH,  onBufferChange)

The side-effect was that the buffer was marked as if its content was changed and to close the buffer it was necessary to confirm closing without saving the changes or to save the buffer in spite of the fact its content wasn't changed.

Is it a bug or a feature?

orbitalquark commented 8 months ago

Encodings are tricky. The entire buffer is replaced when there is an encoding change. I'm not positive that a byte-for-byte comparison can determine equality after an encoding change. Therefore, I'm not sure if this is a bug or not.

oOosys commented 8 months ago

@orbitalquark : if I understand it right encoding does not change the text data represented by the buffer, so there is no need for any before/after comparison. Encoding changes only the way the buffer is rendered to the display. So there is no need to save the unchanged buffer text data after a change of encoding, right? And the marking is there to indicate changes of buffer text data, not changes of how the buffer is rendered to the screen, right? In this context I can't see any reason why to mark the buffer as changed requiring saving the changes after changing the encoding. Texadept does not add a BOM in front of utf-8 data, doesn't it? And ... the encoding of the buffer is known, so if there is no change in encoding compared to current one, there is no change at all. The encoding is there to have an impact on how the underlying text is shown on screen and to decide how to change the buffer text on user input. So deciding the encoding on buffer load has no impact on the underlying buffer text data, right?

orbitalquark commented 8 months ago

Sometimes what you say is true, but not always. Changing the encoding can change how the file is written on disk (in fact, sometimes you want to change a file's encoding on disk). Take UTF-16BE vs. UTF-16LE. The BOM (byte order mark) written at the beginning of the file dictates how the bytes are written to disk, and how they should be interpreted when read back.

Like I said, encodings are tricky. It's probably an error to assume one thing or another.

oOosys commented 8 months ago

OK ... sorry for being wrong with my assumptions. I see, that my naive way of understanding encoding was fully wrong (not only a bit) and actually change of encoding changes the content of the byte data in a buffer which are then written to the file. Thanks for that. No idea how my brain was coming up with what I have stated, as generally I already knew all of this also before ... OK ... What finally remains true is probably that the buffer stores the information about the own encoding, so it should be possible to detect that the requested "new" encoding is the exactly the same, so nothing need to be done to the data content of the buffer to adjust it to the encoding, right? Knowing this it should be generally possible to me to find out where in code there is no check against the current encoding when requesting to set it to a given value. Anyway ... how can I generally assure that loaded files are interpreted as utf-8 ones? I am not using any other encoding except ASCII with only lower part up to 7F which is equivalent then to utf-8, and utf-8, so choosing utf-8 as encoding covers all my needs. Maybe it is a bad idea to cover encoding change as part of the text editor instead of delegating it to a converter which provides the text editor with the preferred encoding. Covering all the possible cases of encodings alone could extremely inflate the amount of required code. The text editor purpose is text editing ... isn't it? Not converting between all of the possible encodings out there. I see in the menu, that only CP1252, ASCII, utf-8 and utf-16 are there to choose from. Why not generally go for utf-8 only and provide some converters as add-on sugar? It would reduce the complexity of the application and avoid running into such problems as this one I bumped just into.

orbitalquark commented 8 months ago

No problem. As I mentioned before, encodings are hard...

Textadept assumes UTF-8 by default, but tries to detect other common encodings like CP1252 and UTF-16 via io.encodings (https://orbitalquark.github.io/textadept/api.html#io.encodings). If Textadept cannot detect an encoding, it will inelegantly notify the user with a "Conversion failed" error. So if your file opened without error, it's highly likely Textadept correctly identified its encoding (most likely UTF-8). If all you use is UTF-8, you shouldn't have to worry about anything.

Hopefully this answers any questions you had.

orbitalquark / textadept

buffer:set_encoding('UTF-8') causes marking buffer as changed #484