lifthrasiir / rust-encoding

Character encoding support for Rust
MIT License
284 stars 59 forks source link

Charset request: ArmSCII-8 #105

Closed 17dec closed 7 years ago

17dec commented 7 years ago

Would it be possible to add support for the ArmSCII-8 encoding? Ref: https://manned.org/armscii-8 and https://en.wikipedia.org/wiki/ArmSCII

I had a quick look to see if I could add this myself, as it's just a single-byte encoding; But seeing how all current codecs are autogenerated from the whatwg specs, I'm a bit lost as to the best approach to implement a custom codec. I'd be happy to provide a PR if I have some guidance on the next steps to take.

lifthrasiir commented 7 years ago

Sorry for the late reply. I think it is not hard to add, provided that there is a canonical Unicode mapping for that encoding---once you have a mapping, you can make use of the common single-byte encoding framework. If you are working outside of Encoding you have to implement the entire Encoding interface by yourself, but it is not very hard to do.

One of the most concerning issue on this encoding to me is that it has enough variants with questionable current usages that it might not be feasible to support. IANA charset registry also misses them, possibly due to this problem. I would be appreciated if you have any up-to-date information on that.

17dec commented 7 years ago

Unfortunately I'm no expert on ArmSCII-8 or its usage in practice. Except that the Linux man-pages project, which maintains the man page I linked, used to encode that document itself in armscii-8. Nowadays they provide the man pages in UTF-8, but alas, I also need to be able to decode older versions.

While I don't have much of a spec to go on, could the character mappings in something like GNU libiconv count as a reference implementation?

Regarding the implementation, the single-byte encoding framework only requires an implementation of forward() and backward(), right? That does seem simple enough, although an optimized backward() may require some more effort, which I doubt will be worth anyone's time.

lifthrasiir commented 7 years ago

Unfortunately I'm no expert on ArmSCII-8 or its usage in practice. Except that the Linux man-pages project, which maintains the man page I linked, used to encode that document itself in armscii-8. Nowadays they provide the man pages in UTF-8, but alas, I also need to be able to decode older versions.

So far this is what I'm most concerned of. Wikipedia lists at least three different versions of ArmSCII with possibly more revisions. This is true as well for many CJK encodings, but I roughly know what's going on and the WHATWG Encoding standard gives a strong opinion from very significant applications, i.e. web browsers. For ArmSCII, I have no such precursor.

While I don't have much of a spec to go on, could the character mappings in something like GNU libiconv count as a reference implementation?

Is this what the actual users of ArmSCII do use today? If so, I guess it's fine.

Regarding the implementation, the single-byte encoding framework only requires an implementation of forward() and backward(), right? That does seem simple enough, although an optimized backward() may require some more effort, which I doubt will be worth anyone's time.

Yes (if you are modifying Encoding). backward can be simply implemented as match as well, the autogenerated code is mostly used for avoiding binary bloat (match against a wide range tends to make a larger binary).

Actually I have another idea (just in case you want to do more work :-): you can make a mapping just like the Encoding standard's index format (the third column is ignored) and route gen_index.py to use them. This sounds a right direction for additional encodings in the future.

lifthrasiir commented 7 years ago

Closed as per #106.