display hex placeholders for 0x80-0xFF

bazzargh commented 4 years ago

cosmetic enhancement: when you shrink code, you see a mess of accented characters and blanks in the editor. A bunch of the byte tokens for keywords fall into the C1 control character block, so just show up as spaces. It makes it hard to tell what's going on (and I know, I'm reading the compressed code - I shouldn't really care)

Making the spans that contain the tokens have title attributes with the expanded keyword would be ideal, since then mouseover would reveal what is going on. However, this doesn't seem possible with monaco. Another solution does work tho:

@font
-face {
 font-family: 'Unprintable';
 src: local('Unicode BMP Fallback SIL');
 unicode-range: U+80-9F,U+A1-FF;
}

declare a fallback font, 'Unprintable', for the high-byte characters. SIL BMP displays hexadecimal instead of the normal glyphs https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=unicodebmpfallbackfont ... the full font covers all of the basic multilingual plane, subsetting it to just the ones used would be ideal. Here I'm using a local font but the font would have to be imported for most users. A0 is excluded, since I see the editor uses non-breaking spaces to display normal spaces in the code. Then just use Unprintable as the first font, so it replaces that character range:

.monaco-editor {
    font-family: Unprintable,-apple-system,BlinkMacSystemFont,Segoe WPC,Segoe UI,HelveticaNeue-Light,system-ui,Ubuntu,Droid Sans,sans-serif;
    --monaco-monospace-font: Unprintable,"SF Mono",Monaco,Menlo,Consolas,"Ubuntu Mono","Liberation Mono","DejaVu Sans Mono","Courier New",monospace;
}

ojwb commented 4 years ago

For the unprintables this makes sense.

This would also be helpful for bytes 0-9, 11-31 and 127, which you'll find in some of the programs. Byte 10 is a newline to the bot and editor. Byte 13 is problematic as it's the end of line marker in tokenised BASIC, but all the others work from my experiments.

I'm not so sure about U+A1-FF - it seems a bit arbitrary that they would show as hex but U+100 and up wouldn't (the bot chops off the higher bits, so e.g. U+1A1 -> byte &A1). Twitter seems to strip at least some of U+80-9F so in non-base2048 tweets you'll generally see those as U+180-U+19F, or higher codepoints with the desired bottom 8 bits. Base2048 can only encode 0-255 so there you'll see U+80-U+9F (but they're protected from twitter by the encoding).

bazzargh commented 4 years ago

I'm a bit unclear on how you are getting U+1A1... I thought the programs were just byte strings-with only ascii printable, since that's all the bbc supported-and for the purpose of display they were being treated as Latin-1 for the high bytes (so, just unicode U+00-FF, utf-8 00-C3BF). I haven't looked at the code tho so...I'm probably wrong :)

My guess behind only doing U+80-FF was that I'd read http://www.benryves.com/bin/bbcbasic/manual/Appendix_Tokeniser.htm#:~:text=BBC%20BASIC%20programs%20are%20internally,the%20single%20byte%20value%20%26E1. and my understanding was that tokenisation would only result in byte 0x21-0xFF. 0x21-0x7F are already printable, and I was assuming users didn't type other control characters into programs before tokenization. I then tried to represent bytes not tokens - that BMP font can't treat 4-byte tokens as single glyphs anyway.

ojwb commented 4 years ago

As I said above - the bot ignores any bits above the bottom 8, so if the tweet contains U+01A1 then the bot feeds byte &A1 to the emulator (also for U+02A1, U+03A1, etc.) The editor can load input from a tweet, so it's helpful if it handles the tricks used in those tweets well.

The bot actually works from UTF-16 data, so at some point surrogate pairs enter the picture, but again the bot just strips away the upper bits from the codepoints it gets as it feeds them in, you just end up with two bytes from a Unicode character that needs a surrogate pair in UTF-16. (Twitter counts higher codepoints double when enforcing its length limit, so this sadly doesn't offer a way to cram in any extra bytes of code...)

I was assuming users didn't type other control characters into programs before tokenization

Certain users do. You can use it to cram a load of VDU, PLOT, GCOL, COLOUR, etc with constant arguments into a single much more compact PRINT statement. Another trick which results in the lower control bytes in tweeted programs is <CALL token><PAGE token>+9:<binary machine code>.

ojwb commented 3 years ago

Here's a concrete example which stuffs binary data into a REM at the start of the program and also shows using codepoints above U+FF for tokens such as MOD: https://twitter.com/rheolism/status/1334151830066974721

mattgodbolt commented 3 years ago

@ojwb we're starting to realise there could be a difference between what we show in the editor and what we send to twitter. Lots of the random 0x1XX characters we send to twitter show up as non-constant-widht characters in the editor which is a little odd. We're experimenting with trying to show something consistent for tokens like @bazzargh suggests. Then we can have a separate encode/decode from twitter to minimise the number of bytes. Still sketching ideas though!

mattgodbolt commented 3 years ago

Annoyingly, some of the token values in the range 0x80-0xa0ish are interpreted as special by monaco, so even with the fallback they don't display. We can definitely do something cool here though; I'm looking at hacking a custom TTF for now. As a starting point though, showing 01YY for token YY might be a stopgap

mattgodbolt commented 3 years ago

I have something mostly working, needs hackery. Unfortunately the twitter encoded as-is confuses monaco (as it uses some codes that are whitespaces in monaco's eyes e.g. \u2000-200d etc. So; I think we need to "decode" twitter pastes/etc into their raw bytes, then re-encode for monaco (0x1YY or similar. Maybe something more specific in the private unicode space). THEN when we tweet we of course always encode using @ojwb 's cunning rules to map the byte to its smallest tweetable value.

mattgodbolt commented 3 years ago

Work in progress: Screenshot from 2020-12-05 10-37-30

ojwb commented 3 years ago

That's definitely better than invisible characters.

I've been vaguely wondering if actually the editor might do better to use tokenised BASIC as its internal form - the tokenisation is mostly line based (the only exception I can think of is whether we're in assembler) so only the current line would need reprocessing for each update, and the line length is limited to 251 bytes. This would sidestep problems such as a tokenised line being too long once expanded, and the expanded form not retokenising without inserting spaces in some cases (e.g. A=BORK works without spaces if OR is already a token, but expanding and retokenising it would treat BORK as a variable).

Using the tokenised form internally doesn't solve all problems (e.g. TOP vs TO+P is still an issue in the tokenised state), and it doesn't tell you if there are syntax errors that only get detected and reported at runtime.

(I guess the code could be executed in the background and the emulator memory monitored to detect errors, though that means a rather slow reporting of errors if there's a REP.U.TI.>3000 before them. Inherently this sort of thing is harder for an interpreted language compared to a compiled one.)

mattgodbolt / owlet-editor

display hex placeholders for 0x80-0xFF #42