Implement remaining encodings

kddnewton commented 10 months ago

Encodings are sets of configuration that describe how bytes map to codepoints. Some of the more well-known encodings include US-ASCII, UTF-8, and Shift_JIS.

Ruby source files can be encoded in 90 different encodings, depending on the presence of a # encoding: xxx magic comment at the top of the source file. Encoding support is necessary to properly parse identifiers, constants, and string content. As such, we need to support all of the encodings that CRuby supports.

Fortunately, we don't need as much information as CRuby does for each encoding. We only need:

whether or not a codepoint is alphabetical
whether or not a codepoint is alphanumeric
whether or not a codepoint is uppercase

For single byte encodings (i.e., encodings that represent every codepoint with at most one byte) we put all of their configuration into src/enc/pm_tables.c. These contains sets of lookup tables for each encoding. Because they're all limited to a single byte, we use arrays of uint8_t integers of length 256. Each integer is a bitmap containing the three bits of information listed above. For an example PR on how to add one of these encodings, see #1851.

For multi-byte encodings (i.e., encodings that can represent codepoints with more than one byte) we put them into their own files under src/enc/ because they require more logic. For the most part it involves writing the pm_encoding_*_char_width function to decode the next codepoint from the given uint8_t slice. For an example PR on how to add one of these encodings, see #1844.

The tasks listed below are the encodings we still have to implement. Some of these will be easier than others, since some of them are single byte. I don't actually know yet which ones are since I'm not familiar with any of these - my personal process is to go straight to wikipedia to figure it out. If you want to take on a particular encoding, make a comment on this issue and we'll pull it out into its own issue to track your work.

[x] Big5-HKSCS
[x] Big5-UAO
[x] CESU-8
[x] #1844
[x] CP949
[x] CP950
[x] EUC-JIS-2004
[x] EUC-KR
[x] EUC-TW
[x] Emacs-Mule
[x] GB12345
[x] GB18030
[x] GB1988
[x] GB2312
[x] #1856
[x] IBM864
[x] IBM865
[x] #1866
[x] IBM869
[x] KOI8-U
[x] MacJapanese
[x] SJIS-DoCoMo
[x] SJIS-KDDI
[x] SJIS-SoftBank
[x] TIS-620
[x] UTF8-DoCoMo
[x] UTF8-KDDI
[x] UTF8-SoftBank
[x] #1857
[x] eucJP-ms
[x] macCentEuro
[x] macCroatian
[x] macCyrillic
[x] #1850
[x] macIceland
[x] #1863
[x] macRomania
[x] #1862
[x] #1855
[x] #1859
[x] stateless-ISO-2022-JP
[x] stateless-ISO-2022-JP-KDDI

duerst commented 10 months ago

@kddnewton This may be the wrong issue to discuss a more fundamental issue here. If there's a better issue, please tell me.

Ruby itself has all the necessary information about encodings. Why all this work? It looks like a lot of unnecessary duplication to me. If there's something in Ruby itself that could be easily exposed to help, then let's do that.

pcai commented 10 months ago

I'd like to try Windows-874, I looked at a few with funny names like UTF8-SoftBank but can't find anything authoritative other than a few scattered references in cruby itself 🤔

pcai commented 10 months ago

A note for anyone else who tries this: the tests may fail if you don't account for aliases, see Encoding.aliases or ENC_ALIAS

kcdragon commented 10 months ago

I'll take a look at IBM863

eregon commented 10 months ago

These are single-byte encodings:

$ truffleruby -e 'pp Encoding.list.select(&:ascii_compatible?).sort_by(&:name).select { |e| Truffle::CExt.rb_enc_mbmaxlen(e) == 1 }'      
[#<Encoding:ASCII-8BIT>,
 #<Encoding:CP850>,
 #<Encoding:CP852>,
 #<Encoding:CP855>,
 #<Encoding:GB1988>,
 #<Encoding:IBM437>,
 #<Encoding:IBM720>,
 #<Encoding:IBM737>,
 #<Encoding:IBM775>,
 #<Encoding:IBM852>,
 #<Encoding:IBM855>,
 #<Encoding:IBM857>,
 #<Encoding:IBM860>,
 #<Encoding:IBM861>,
 #<Encoding:IBM862>,
 #<Encoding:IBM863>,
 #<Encoding:IBM864>,
 #<Encoding:IBM865>,
 #<Encoding:IBM866>,
 #<Encoding:IBM869>,
 #<Encoding:ISO-8859-1> ... to ... #<Encoding:ISO-8859-16>,
 #<Encoding:KOI8-R>,
 #<Encoding:KOI8-U>,
 #<Encoding:TIS-620>,
 #<Encoding:US-ASCII>,
 #<Encoding:Windows-1250> ... to ... #<Encoding:Windows-1258>,
 #<Encoding:Windows-874>,
 #<Encoding:macCentEuro>,
 #<Encoding:macCroatian>,
 #<Encoding:macCyrillic>,
 #<Encoding:macGreek>,
 #<Encoding:macIceland>,
 #<Encoding:macRoman>,
 #<Encoding:macRomania>,
 #<Encoding:macThai>,
 #<Encoding:macTurkish>,
 #<Encoding:macUkraine>]

And these are multi-byte encodings:

$ truffleruby -e 'pp Encoding.list.select(&:ascii_compatible?).sort_by(&:name).select { |e| Truffle::CExt.rb_enc_mbmaxlen(e) > 1 }' 
[#<Encoding:Big5>,
 #<Encoding:Big5-HKSCS>,
 #<Encoding:Big5-UAO>,
 #<Encoding:CESU-8>,
 #<Encoding:CP51932>,
 #<Encoding:CP949>,
 #<Encoding:CP950>,
 #<Encoding:CP951>,
 #<Encoding:EUC-JIS-2004>,
 #<Encoding:EUC-JP>,
 #<Encoding:EUC-KR>,
 #<Encoding:EUC-TW>,
 #<Encoding:Emacs-Mule>,
 #<Encoding:GB12345>,
 #<Encoding:GB18030>,
 #<Encoding:GB2312>,
 #<Encoding:GBK>,
 #<Encoding:MacJapanese>,
 #<Encoding:SJIS-DoCoMo>,
 #<Encoding:SJIS-KDDI>,
 #<Encoding:SJIS-SoftBank>,
 #<Encoding:Shift_JIS>,
 #<Encoding:UTF-8>,
 #<Encoding:UTF8-DoCoMo>,
 #<Encoding:UTF8-KDDI>,
 #<Encoding:UTF8-MAC>,
 #<Encoding:UTF8-SoftBank>,
 #<Encoding:Windows-31J>,
 #<Encoding:eucJP-ms>,
 #<Encoding:stateless-ISO-2022-JP>,
 #<Encoding:stateless-ISO-2022-JP-KDDI>]

kddnewton commented 10 months ago

@duerst this parser runs outside the context of Ruby and is embedded into many other projects, so we don't have access to Ruby APIs when running. It's a completely standalone project.

I actually just sent you an email yesterday, while I have you, would you be able to check out the code in this snippet? https://github.com/ruby/prism/blob/f0f057b055c7d15c490ef1e9cd91ca6702a04d14/test/prism/encoding_test.rb#L187-L204. It looks like CRuby folds uppercase characters to lowercase to determine if a codepoint is the start of a constant, but there are just a couple of codepoints in two encodings that are very confusingly reporting themselves as lowercase but then also changing when folded.

kddnewton commented 10 months ago

@kcdragon looks like Maple grabbed up IBM863, want to try one of the other IBMs?

kcdragon commented 10 months ago

@kcdragon looks like Maple grabbed up IBM863, want to try one of the other IBMs?

Sure, I'll take a look at IBM864.

faraazahmad commented 10 months ago

Can I take up IBM866?

kddnewton commented 10 months ago

@faraazahmad absolutely!

thomasmarshall commented 10 months ago

Can I pick up macCentEuro?

orhantoy commented 10 months ago

I'll open a PR for macCyrillic shortly.

duerst commented 10 months ago

@duerst this parser runs outside the context of Ruby and is embedded into many other projects, so we don't have access to Ruby APIs when running. It's a completely standalone project.

A Ruby parser outside the context of Ruby sounds a bit strange. With a bit of work, we could easily have found a way to reuse the transcoding work done in CRuby, rather than doing lots of duplicate work.

I actually just sent you an email yesterday,

Sorry, I must have missed your mail.

while I have you, would you be able to check out the code in this snippet? https://github.com/ruby/prism/blob/f0f057b055c7d15c490ef1e9cd91ca6702a04d14/test/prism/encoding_test.rb#L187-L204

U+01c5, LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON, starts with a capital letter and therefore definitely should be okay for class names. Similar for 0x01c8, 0x01cb, and 0x01f2. The rest are Greek uppercase letters with adscript (PROSGEGRAMMENI) IOTA, again a very similar case. Essentially, there's nothing is wrong with them starting a class name. Also, you say "I have reported this bug upstream.". Can you give a pointer?

It looks like CRuby folds uppercase characters to lowercase to determine if a codepoint is the start of a constant, but there are just a couple of codepoints in two encodings that are very confusingly reporting themselves as lowercase but then also changing when folded.

Please be careful with the term "fold". Unicode distinguishes between simple and full case folding, see https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt for details. As to the codepoints in question, I guess that would be "ß" (German sz) and "ς" (Greek final sigma). These are indeed lowercase. But with full case folding, they map to "ss" (double s) and "σ" (non-final Greek sigma). See also the discussion of the :fold option in doc/case_mapping.rdoc in the CRuby source.

Please feel free to ask additional questions.

davidwessman commented 10 months ago

I will try MacCroatian today 😊

Edit: https://github.com/ruby/prism/pull/1880

duerst commented 10 months ago

@kddnewton

Please be careful with the term "fold". Unicode distinguishes between simple and full case folding, see https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt for details. As to the codepoints in question, I guess that would be "ß" (German sz) and "ς" (Greek final sigma). These are indeed lowercase. But with full case folding, they map to "ss" (double s) and "σ" (non-final Greek sigma). See also the discussion of the :fold option in doc/case_mapping.rdoc in the CRuby source.

Sorry, I wasn't careful enough. Unicode distinguishes between case mapping, simple case folding, and full case folding. Ruby doesn't support simple case folding, because we don't have to keep string lengths constant when doing folding. The :fold option on String#downcase indicates full case folding; without it, it's just case mapping, which doesn't change lowercase letters.

davidwessman commented 10 months ago

https://github.com/ruby/prism/pull/1881 I think the Mac Roman encoding table was missing some data, and therefore a lot of the subsequent ones are wrong as well. Would be greatful for some feedback in case we misunderstood how it works.

derekcmoore commented 10 months ago

I am taking a look at IBM865.

eregon commented 10 months ago

A Ruby parser outside the context of Ruby sounds a bit strange. With a bit of work, we could easily have found a way to reuse the transcoding work done in CRuby, rather than doing lots of duplicate work.

@duerst Prism works as a gem, can be used in JavaScript/WASM and called from Java (for JRuby & TruffleRuby), not all use cases have access to C functions for codepoint characteristics for those various encodings. Also the knowledge about encodings seems pretty minimal so it's not much work, mostly generating tables from existing knowledge.