Closed kddnewton closed 9 months ago
@kddnewton This may be the wrong issue to discuss a more fundamental issue here. If there's a better issue, please tell me.
Ruby itself has all the necessary information about encodings. Why all this work? It looks like a lot of unnecessary duplication to me. If there's something in Ruby itself that could be easily exposed to help, then let's do that.
I'd like to try Windows-874, I looked at a few with funny names like UTF8-SoftBank
but can't find anything authoritative other than a few scattered references in cruby itself 🤔
A note for anyone else who tries this: the tests may fail if you don't account for aliases, see Encoding.aliases
or ENC_ALIAS
I'll take a look at IBM863
These are single-byte encodings:
$ truffleruby -e 'pp Encoding.list.select(&:ascii_compatible?).sort_by(&:name).select { |e| Truffle::CExt.rb_enc_mbmaxlen(e) == 1 }'
[#<Encoding:ASCII-8BIT>,
#<Encoding:CP850>,
#<Encoding:CP852>,
#<Encoding:CP855>,
#<Encoding:GB1988>,
#<Encoding:IBM437>,
#<Encoding:IBM720>,
#<Encoding:IBM737>,
#<Encoding:IBM775>,
#<Encoding:IBM852>,
#<Encoding:IBM855>,
#<Encoding:IBM857>,
#<Encoding:IBM860>,
#<Encoding:IBM861>,
#<Encoding:IBM862>,
#<Encoding:IBM863>,
#<Encoding:IBM864>,
#<Encoding:IBM865>,
#<Encoding:IBM866>,
#<Encoding:IBM869>,
#<Encoding:ISO-8859-1> ... to ... #<Encoding:ISO-8859-16>,
#<Encoding:KOI8-R>,
#<Encoding:KOI8-U>,
#<Encoding:TIS-620>,
#<Encoding:US-ASCII>,
#<Encoding:Windows-1250> ... to ... #<Encoding:Windows-1258>,
#<Encoding:Windows-874>,
#<Encoding:macCentEuro>,
#<Encoding:macCroatian>,
#<Encoding:macCyrillic>,
#<Encoding:macGreek>,
#<Encoding:macIceland>,
#<Encoding:macRoman>,
#<Encoding:macRomania>,
#<Encoding:macThai>,
#<Encoding:macTurkish>,
#<Encoding:macUkraine>]
And these are multi-byte encodings:
$ truffleruby -e 'pp Encoding.list.select(&:ascii_compatible?).sort_by(&:name).select { |e| Truffle::CExt.rb_enc_mbmaxlen(e) > 1 }'
[#<Encoding:Big5>,
#<Encoding:Big5-HKSCS>,
#<Encoding:Big5-UAO>,
#<Encoding:CESU-8>,
#<Encoding:CP51932>,
#<Encoding:CP949>,
#<Encoding:CP950>,
#<Encoding:CP951>,
#<Encoding:EUC-JIS-2004>,
#<Encoding:EUC-JP>,
#<Encoding:EUC-KR>,
#<Encoding:EUC-TW>,
#<Encoding:Emacs-Mule>,
#<Encoding:GB12345>,
#<Encoding:GB18030>,
#<Encoding:GB2312>,
#<Encoding:GBK>,
#<Encoding:MacJapanese>,
#<Encoding:SJIS-DoCoMo>,
#<Encoding:SJIS-KDDI>,
#<Encoding:SJIS-SoftBank>,
#<Encoding:Shift_JIS>,
#<Encoding:UTF-8>,
#<Encoding:UTF8-DoCoMo>,
#<Encoding:UTF8-KDDI>,
#<Encoding:UTF8-MAC>,
#<Encoding:UTF8-SoftBank>,
#<Encoding:Windows-31J>,
#<Encoding:eucJP-ms>,
#<Encoding:stateless-ISO-2022-JP>,
#<Encoding:stateless-ISO-2022-JP-KDDI>]
@duerst this parser runs outside the context of Ruby and is embedded into many other projects, so we don't have access to Ruby APIs when running. It's a completely standalone project.
I actually just sent you an email yesterday, while I have you, would you be able to check out the code in this snippet? https://github.com/ruby/prism/blob/f0f057b055c7d15c490ef1e9cd91ca6702a04d14/test/prism/encoding_test.rb#L187-L204. It looks like CRuby folds uppercase characters to lowercase to determine if a codepoint is the start of a constant, but there are just a couple of codepoints in two encodings that are very confusingly reporting themselves as lowercase but then also changing when folded.
@kcdragon looks like Maple grabbed up IBM863, want to try one of the other IBMs?
@kcdragon looks like Maple grabbed up IBM863, want to try one of the other IBMs?
Sure, I'll take a look at IBM864.
Can I take up IBM866?
@faraazahmad absolutely!
Can I pick up macCentEuro
?
I'll open a PR for macCyrillic
shortly.
@duerst this parser runs outside the context of Ruby and is embedded into many other projects, so we don't have access to Ruby APIs when running. It's a completely standalone project.
A Ruby parser outside the context of Ruby sounds a bit strange. With a bit of work, we could easily have found a way to reuse the transcoding work done in CRuby, rather than doing lots of duplicate work.
I actually just sent you an email yesterday,
Sorry, I must have missed your mail.
while I have you, would you be able to check out the code in this snippet? https://github.com/ruby/prism/blob/f0f057b055c7d15c490ef1e9cd91ca6702a04d14/test/prism/encoding_test.rb#L187-L204
U+01c5, LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON, starts with a capital letter and therefore definitely should be okay for class names. Similar for 0x01c8, 0x01cb, and 0x01f2. The rest are Greek uppercase letters with adscript (PROSGEGRAMMENI) IOTA, again a very similar case. Essentially, there's nothing is wrong with them starting a class name. Also, you say "I have reported this bug upstream.". Can you give a pointer?
It looks like CRuby folds uppercase characters to lowercase to determine if a codepoint is the start of a constant, but there are just a couple of codepoints in two encodings that are very confusingly reporting themselves as lowercase but then also changing when folded.
Please be careful with the term "fold". Unicode distinguishes between simple and full case folding, see https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt for details. As to the codepoints in question, I guess that would be "ß" (German sz) and "ς" (Greek final sigma). These are indeed lowercase. But with full case folding, they map to "ss" (double s) and "σ" (non-final Greek sigma). See also the discussion of the :fold option in doc/case_mapping.rdoc in the CRuby source.
Please feel free to ask additional questions.
I will try MacCroatian today 😊
@kddnewton
Please be careful with the term "fold". Unicode distinguishes between simple and full case folding, see https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt for details. As to the codepoints in question, I guess that would be "ß" (German sz) and "ς" (Greek final sigma). These are indeed lowercase. But with full case folding, they map to "ss" (double s) and "σ" (non-final Greek sigma). See also the discussion of the :fold option in doc/case_mapping.rdoc in the CRuby source.
Sorry, I wasn't careful enough. Unicode distinguishes between case mapping, simple case folding, and full case folding. Ruby doesn't support simple case folding, because we don't have to keep string lengths constant when doing folding. The :fold option on String#downcase indicates full case folding; without it, it's just case mapping, which doesn't change lowercase letters.
https://github.com/ruby/prism/pull/1881 I think the Mac Roman encoding table was missing some data, and therefore a lot of the subsequent ones are wrong as well. Would be greatful for some feedback in case we misunderstood how it works.
I am taking a look at IBM865.
A Ruby parser outside the context of Ruby sounds a bit strange. With a bit of work, we could easily have found a way to reuse the transcoding work done in CRuby, rather than doing lots of duplicate work.
@duerst Prism works as a gem, can be used in JavaScript/WASM and called from Java (for JRuby & TruffleRuby), not all use cases have access to C functions for codepoint characteristics for those various encodings. Also the knowledge about encodings seems pretty minimal so it's not much work, mostly generating tables from existing knowledge.
I will take a look today at IBM869 :)
Thanks @derekcmoore and @lynne-ashminov!
I would like to work on TIS-620
in case it's still available.
@thdaraujo please do!
I'll take a shot at KOI8-U
Thanks @heyogrady!
Attempting MacJapanese, but not sure if I've found the right source of truth. https://github.com/ruby/prism/pull/1935
Gonna attempt CP950
Is EUC-KR
still available? It appears checked off but there isn't a PR for it.
@yngx I handled it in a PR I'm about to put up - sorry to disappoint!
Encodings are sets of configuration that describe how bytes map to codepoints. Some of the more well-known encodings include
US-ASCII
,UTF-8
, andShift_JIS
.Ruby source files can be encoded in 90 different encodings, depending on the presence of a
# encoding: xxx
magic comment at the top of the source file. Encoding support is necessary to properly parse identifiers, constants, and string content. As such, we need to support all of the encodings that CRuby supports.Fortunately, we don't need as much information as CRuby does for each encoding. We only need:
For single byte encodings (i.e., encodings that represent every codepoint with at most one byte) we put all of their configuration into
src/enc/pm_tables.c
. These contains sets of lookup tables for each encoding. Because they're all limited to a single byte, we use arrays ofuint8_t
integers of length 256. Each integer is a bitmap containing the three bits of information listed above. For an example PR on how to add one of these encodings, see #1851.For multi-byte encodings (i.e., encodings that can represent codepoints with more than one byte) we put them into their own files under
src/enc/
because they require more logic. For the most part it involves writing thepm_encoding_*_char_width
function to decode the next codepoint from the givenuint8_t
slice. For an example PR on how to add one of these encodings, see #1844.The tasks listed below are the encodings we still have to implement. Some of these will be easier than others, since some of them are single byte. I don't actually know yet which ones are since I'm not familiar with any of these - my personal process is to go straight to wikipedia to figure it out. If you want to take on a particular encoding, make a comment on this issue and we'll pull it out into its own issue to track your work.
Big5-HKSCS
Big5-UAO
CESU-8
CP949
CP950
EUC-JIS-2004
EUC-KR
EUC-TW
Emacs-Mule
GB12345
GB18030
GB1988
GB2312
IBM864
IBM865
IBM869
KOI8-U
MacJapanese
SJIS-DoCoMo
SJIS-KDDI
SJIS-SoftBank
TIS-620
UTF8-DoCoMo
UTF8-KDDI
UTF8-SoftBank
eucJP-ms
macCentEuro
macCroatian
macCyrillic
macIceland
macRomania
stateless-ISO-2022-JP
stateless-ISO-2022-JP-KDDI