inexorabletash / text-encoding

Polyfill for the Encoding Living Standard's API
Other
720 stars 268 forks source link

CP437 Encoder/Decoder #40

Closed tracker1 closed 8 years ago

tracker1 commented 8 years ago

Just thinking it would be nice to have a CP437 (IBM US-DOS Extended ASCII) convertors available.

There have been a number of projects to convert ansi/ascii art for display in modern UTF systems... It would be a nice to have feature if this were available here... I'm not sure about the format used in the lib/encoding-indexes.js file... Is the array simply a mapping of position X to character code/pair Y? Does this start at the byte value of 1 (skipping null)? And am I correct in assuming I can go all the way from character 1 through 255, and does null mean don't map, or default map?

I'd be happy to make a PR if these questions could be clarified... possibly adding comments as to the structure of the encoding-indexes in the file itself.

inexorabletash commented 8 years ago

That's a discussion better suited to the Encoding standard:

https://encoding.spec.whatwg.org/ https://github.com/whatwg/encoding/issues

Other than the non-standard encoding capability (which is here primarily to validate the spec) I don't want to introduce anything not covered in Encoding.

(And the Encoding standard is trying to be as limited as possible - unless existing Web content demands it we are extremely unlikely to support any new encodings.)

tracker1 commented 8 years ago

@inexorabletash I understand... Would you be willing to clarify the questions regarding the structure of encoding-indexes so that I'd be able to add it in my own fork?

inexorabletash commented 8 years ago

The existing indexes are defined in the Encoding Standard, https://encoding.spec.whatwg.org/#indexes

Basically the index for an encoding is a resource to make the decoder/encoder handler something other than a huge switch statement.

The decoder handler just takes the input byte (or bytes) does math to get an index pointer, and looks up index[pointer]. For CP437 you could just have 256 entries (the 95 printable ASCII ones would have value == index) or you could be more clever and do math to avoid those 95 values. I'd keep it simple, honestly.

The encode handler does the reverse - looks up the Unicode code point in the table (inefficiently!), and if found it returns the corresponding index, then does math to convert that to a byte. Again, if you go with a simple 256 entry table then it's just the index itself as the output byte.

Hope that helps!

tracker1 commented 8 years ago

@inexorabletash thanks, will look into this from here...