pulsar-edit / pulsar

A Community-led Hyper-Hackable Text Editor
https://pulsar-edit.dev
Other
3.24k stars 137 forks source link

Rework command parsing to support `--package` better #1054

Open savetheclocktower opened 2 months ago

savetheclocktower commented 2 months ago

This is a concept for how we'd rewrite argument parsing to make --package work better and fix the issues detailed in #1053.

I haven't tried any of this on Windows yet — nor have I written new specs or tried to run existing specs — but this passes some sanity tests on my macOS machine:

If I take this out of draft, it means that I've confirmed that these techniques fix our Windows issues, ensured existing specs pass, and written new specs for --package behavior.

savetheclocktower commented 2 months ago

I spent 45 minutes getting this working on Windows and it's driving me nuts.

First: stdio: 'inherit' doesn't work any better on Windows when we call spawn instead of spawnSync, so instead I've switched it to stdio: 'pipe' and hooked up the pipes. This works great! We get output streaming and everything.

Now there's one more headache: character encoding.

With this PR checked out and ATOM_DEV_RESOURCE_PATH set locally…

$env:ATOM_DEV_RESOURCE_PATH = "C:\Path\To\Pulsar"

I can run pulsar --dev --package list in Windows Terminal and see all the pretty tree-like output of that command get mangled. Intuitively, I understand why this happens: all Windows terminals and shells inexplicably (well, probably for backwards compatibility) use Code page 437. Anyone can change their terminal's encoding to UTF-8 via chcp 65001, but we can't make our users do that.

The headaches only start when you realize that ppm list works just fine! All of our weird line-drawing characters are written to stdout as intended. It's only when we add the layer of indirection via child_process.spawn — or spawnSync, or exec, or cross-spawn — I just tried all of them! — that things go sideways.

The one thing I've done that's actually solved this is something that I think is just a bit too clever: alter pulsar.cmd so that it remembers the user's current character set, calls chcp 65001 at the top of the script, then changes it back to the original value just before we exit. This would be a great solution if we could guarantee that the cleanup code would run no matter what — but we can't, because batch files don't really have an equivalent for trapping SIGINT like shell scripts can.

So here are some options, in no particular order:

  1. Literally figure out what's going wrong with encodings somewhere in the Node standard library and arrive at the magic configuration that will avoid this issue.
  2. Reverse-engineer the mojibake to find a way to undo the bad conversion. (Might be tricky because we'd have to know the ambient character encoding of the shell, I'd imagine.)
  3. Chase down one of the search results I didn't feel like investigating to see if there's a safe way to run something in a subshell in Windows such that you can temporarily change the encoding and then change it back.
  4. Rewrite the output of ppm list to use only the characters that are truly portable across common terminal character encodings.

Other suggestions are welcome.

savetheclocktower commented 2 months ago

I did some experiments that clarified my thinking here.

My initial theory was that both pulsar and ppm were able to output UTF-8 properly into a CP437 console, but that the combination of the two — shelling to ppm from inside pulsar — was screwing things up.

I tested this with a small Node project with various scripts that log the text ├── to stdout:

I could not get the encoding bug to manifest in any of these scenarios.

Running out of ideas, I started calling console.log("├──") or echo ├── at pretty much every step in the process: .cmd files, JS files, and so on. My new theory is that Node handles this correctly (magically ensuring that its UTF-8 output is rendered properly in a CP437 console), but Electron does not. If I insert console.log("├──") somewhere in start.js, the encoding is wrong in the output, even if I never shell out to ppm.

That said, I'm failing in my vague attempts to determine why this is true. If there were some explicit trick that Node were using, it would likely be visible in the Node source code, but I'm not seeing it. Maybe we'll get lucky and find out that this has been fixed in a future version of Electron, but for now I can think of only two ways to work around the problem:

  1. Figure out a way of temporarily changing the encoding of the terminal to UTF-8 for the duration of pulsar.cmd — but a way that won't leak side-effects no matter what.
  2. Change the symbols emitted by ppm to symbols that are safe to rely on even if we don't know the encoding of the terminal.

Option 2 is easy enough (if unsatisfying), but I think we can manage option 1.

It's true (as I said above) that there's no easy way to trap a Windows shell's equivalent of SIGINT. If pulsar.cmd calls chcp 65001 before it calls Pulsar.exe, then we need a way to run some code after that's done, no matter what happens with that call — even when it errors or the user cancels it with Ctrl+C.

But there are theoretical ways around this, depending on how ridiculous you're willing to get! I'll try two of the approaches described in this Stack Overflow answer and let you know how it goes.

savetheclocktower commented 2 months ago

OK, it's even weirder.

Let me start over and state a hypothesis:

Node for Windows (at least as of v16, which ppm uses) has some magic way of ensuring that either (a) what it logs to stdout (which is generally assumed to be UTF-8 in all cases) has its encoding converted to match the default CP437 encoding of a Windows terminal; or (b) the terminal itself is temporarily changed to a UTF-8 code page, then reliably switched back after a script exits.

Electron does not have this magic — either because it's lacking altogether, or because there's an extra layer of abstraction that gets in the way of the magic behavior.

Let's take the character as an example. If my terminal is expecting UTF-8, I can call console.log("├") and everything works. But if it's expecting CP437, I instead see Γö£. That's what you'd expect if a terminal were interpreting UTF-8 text as CP437: instead of one character, you get three.

Why three? Because the UTF-8 code point for is 0x251C. That becomes the raw bytes e2, 94, and 9C via an algorithm I don't understand. But if those three bytes are interpreted as separate characters (as they would be if no conversion were taking place), they'd be the characters Γö£.

Again, all this is exactly what you'd expect in this situation. Node's behavior is the unusual part! Either it's doing behind-the-scenes conversion of characters or it's telling the terminal to use UTF-8 (but only for the duration of a given command).

When we run pulsar and ask it to write to STDOUT, what we're getting from it is basically main-process logging. On a hunch, I started a new Electron boilerplate project and did console.log("├──"); in the entry point script; it output Γö£ΓöÇΓöÇ into my terminal.

I had thought that our last-ditch option here would be to make sure we were emitting only characters that were mentioned in this guide, but this guide isn't what we need, because it assumes that encoding conversions are being done correctly. The box-drawing symbols we're using for ppm list do exist in CP437, but at different code points. (Obvious in retrospect, since the code points used for the UTF-8 versions are multibyte and thus can't exist in CP437.)

Now I think our last-ditch option is to set an environment variable that indicates a Windows user's code page. If it's not 65001, ppm list could detect that and respond with plaintext.


An aside on the batch file exception thing: as expected, it was more complicated than it seemed. The best I could do was to copy this approach, but the annoying "Terminate batch job? (Y/N)" prompt was my undoing. This approach means that the prompt is shown twice in a row, and the cleanup code only happens if you answer "Y" to the first prompt and "N" to the second.

savetheclocktower commented 2 months ago

Anyway, a couple more thoughts: