mattgodbolt / owlet-editor

A modern BBC BASIC editor inspired by the BBC Micro Bot (https://bbcmicrobot.com)
https://bbcmic.ro
54 stars 4 forks source link

Implement "Rheolism" method to pass tokenized code through Twitter #23

Closed 8bitkick closed 3 years ago

8bitkick commented 3 years ago

I haven't checked if the 'official' BBC BASIC byte tokens we generate fail when parsed by Twitter, but Rheolism did a fair amount of testing and discovered certain byte tokens need ANDing (doh) ORing with a high byte to make in through unscathed.

As shown in https://github.com/8bitkick/BBCMicroBot/blob/7bf7435d70cde7892d69c12bf5b934b0ee068ee2/tools/bbcbasictokenise#L17 by @ojwb

This may also fix issues with some bytes not being rendered as characters in the editor as he also reported

ojwb commented 3 years ago

(ORing not ANDing...)

Yes - from my tests some of the control characters U+0080 to U+009F get eaten by twitter (and even if any work, they likely won't be visible which makes reliable cut and paste a pain). I don't have a record of exactly which are problematic as I concluded this sort of thing may change over time and it was simpler just to avoid them all.

The exact token values I use in that script aren't important - it just needs some higher bit or bits set such that it's a "normal" Unicode character (not a control code or some sort of modifier, and ideally not an LTR character as that makes cut and paste more of a pain) but has the right value when the top bits are stripped. I initially tried to choose "pleasing" characters for some common tokens (e.g. Έ for STEP because it looks a bit like some steps) but these are almost all pretty strained as you only get a small number of options, and having tried that I'd just | 0x100 and that'll do the job.

This may also fix issues with some bytes not being rendered as characters in the editor as he also reported

Essentially the same issue, but for U+0000 to U+001F - testcase is:

MO.4:F.R=2TOLEN"ÿ@WƗƇÃÃááðàðàðƀ<lfºƙªfĀB":?8=R?7180:?9=R:F.X=1TO9:MOVE(X*2OR1A.R)*40,R*40+?8MOD2*@%:P."ąęƙĀĀĜĀ":!6=!6/2:N.,:V.1

Paste in this and click "expand" and the Ā in the first string (which is U+0101 for &01) turns into a <?>, and the last string turns into 3 <?> (that string is the VDU codes for V.5:PL.153,0,28, so bytes 5,25,153,0,0,28,0). The program with the <?> still works in the emulator though, so the correct bytes must still be in there.

I think ideally "expand" should add 0x100 to U+0000 to U+001F and U+007F to U+009F (both inclusive) but leave existing printable Unicode characters in strings alone (as it's a fairly clean way to avoid twitter doing hashtags, cashtags, linkifying and other nonsense - e.g. P."ŀbbcmicrobot" "escapes" an @ without using any extra characters). If that's hard to achieve then just doing the "add 0x100" part to everything after dropping the top bits is better than the current stripping of top bits and leaving unprintables.