Encoding problems - Githubissues

gregopet commented 1 year ago

Hi, first of all thanks for the plugin :)

I have a problem with piping non-ASCII text into it: the non-ASCII characters come out as the unicode question mark unknown characters.

For example, if I select the text Čufti, Čežana in Štrudl and pipe them through cat, out comes ��ufti, ��e��ana in ��trudl

profbear commented 1 year ago

Oh wow my first issue! I'll look into that.

Seems like you're trying to pass stdin into cat, and after the response echoes UTF-something, my plugin isn't handling UTF characters correctly.

I'll dig into it. Thanks!

profbear commented 1 year ago

It works on my machine, but that's a terrible excuse.

Can you tell me more about your environment? Are you using a Jetbrains product inside windows, linux, or even passing strings through a WSL version of bash?

You are using bash, right? I'm asking because I guess you could also have downloaded the source code and changed the executable that Pipeprofen is running to execute a shell command.

I'm using Ubuntu 22.04 bash --version GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

gregopet commented 1 year ago

Hey, sure: I'm using Intellij in Linux (Arch), my login shell iz zsh (zsh 5.9 (x86_64-pc-linux-gnu)), though the plugin says it's using bash which I have at version 5.1.16(1)-release (x86_64-pc-linux-gnu)

I can't remember what exactly I was doing to trigger the bug in the first place, but to reproduce I can just:

open one of my Java / Kotlin source files
type a word containing non-ASCII character, say tür, enseñar or hiša and select it
invoke pipeprofen on the selection and use cat as the command
my text will get replaced with t��r, ense��ar and hi��a

Intellij says my file encoding is UTF-8 (at the bottom right), and if I open a YAML file that should handle such characters, the result is the same.

I did a further test: I created a file on my disk containing a non-ascii character and ran a ls through pipeprofen and the output had the invalid characters, so the problem seems to not be with the input interpretation?

If I go into my shell, those characters work just fine (e.g. echo mañana | cat will produce mañana) and in general pretty much all programs handle these characters ok (one notable exception is LibreOffice which loses them on copy/paste but only on my work machine :roll_eyes: ).

profbear commented 1 year ago

I like that test you ran to help us narrow down which side of the plugin has the bug.

Definitely feels like the output side, somewhere between where 1. stdout is read back into the plugin's memory, and 2. the characters are fed back into the editor's window display buffer thing that renders character output.

Lots of magic I need to dispel. Gotta shake the cobwebs off the dev env, too.

Curious!

profbear commented 1 year ago

Note to self: it would be nice if there was a call after getting character output from stdout to detect if the stream is 8bit chars or 16bit wide characters. And even if the characters come back as a stream of bytes, UTF-8 should dictate that if the high-bit is set, then that character is encoding some extended character set or something.

Good gravy I have homework todo.

profbear commented 5 months ago

unfortunately, i'm not able to reproduce this yet. I'm sorry! i'm trying!

profbear / pipeprofen

Encoding problems #4