tats-u / rust-oem-cp

Rust library that handles OEM code pages (e.g. CP{437,737,850}) for single byte character sets
MIT License
5 stars 3 forks source link

Struct-based API / Hide raw types of decoding tables & decoding hash maps #14

Open tats-u opened 3 months ago

tats-u commented 3 months ago

I want to switch this crate to struct-based API:

let encoder = get_cp_encoder(OEMCodePage::Cp437);
// let encoder = try_get_cp_encoder_from_number(437).unwrap(); // For those who want to pass a number directly
let cp437_bytes = encoder.encode_string_lossy("½ ± 1");

The current API looks ugly and exposes too much.

Especially [char; 128] and [Option<char>; 128] should not be exposed.

mkroening commented 3 months ago

I think we can base this on the char-based API and provide both in a nice API.

One question for clarification: Is it common to dynamically change the code page at runtime? Or would something like get_cp_encoder::<437>() be okay too?

tats-u commented 3 months ago

Is it common to dynamically change the code page at runtime?

Not just common but mandatory if you're going to develop apps for worldwide use. Only for an app for a single country (e.g. one for internal use in your company), something like CP437_ENCODER is useful.

OEM code pages (= encoding for ZIP files and the terminal) depends on the language and region in Windows; CP437 in US English and Shift_JIS (CP932; not covered by this crate) in Japanese.

tats-u commented 3 months ago

Ah I got it. the enum-based is more preferred. There's no reason to prefer to use a raw code page number (like 437 instead of OEMCodePage::Cp437). Enum can be easily completed and combined with match.

mkroening commented 3 months ago

Ah, makes sense. This is a draft of the static API I am imagining:

#[derive(Clone, Copy)]
pub struct Cp<const N: u16>(u8);

impl Cp<437> {
    pub const fn new(cp: u8) -> Option<Self>;
    pub const fn get(self) -> u8;
    pub const fn from_char(c: char) -> Option<Self>;
    pub const fn to_char(self) -> char;
}

pub type CpStr<N> = [Cp<N>];

impl CpStr<437> {
    pub const fn from_cp(bytes: &[u8]) -> Result<&Self, CpError>;
    pub const fn to_bytes(&self) -> &[u8];
}

pub struct CpString<const N: u16>(Box<CpStr<N>>);

impl CpString<437> {
    pub fn from_cp_lossy(bytes: &[u8]) -> Cow<'_, CpStr<N>>;
}

This API does not have structural differences between complete and incomplete tables: For complete tables, some operations just never fail.

If I understood you correctly, we also need to find a replacement for accessing DECODING_TABLE_CP_MAP with dynamic numbers, right? The dynamic API could look like this:

pub struct CpNumber(u16);

impl CpNumber {
    pub const fn new(n: u16) -> Option<Self>;
}

impl str {
    pub fn to_cp(&self, n: CpNumber) -> Result<&[u8], CpError>;
    pub fn to_cp_lossy(&self, n: CpNumber) -> Cow<'_, [u8]>;
}

impl [u8] {
    pub fn cp_to_str(&self, n: CpNumber) -> Result<&str, CpError>;
    pub fn cp_to_str_lossy(&self, n: CpNumber) -> Cow<'_, str>;
}

What do you think of this? Would this work for all use-cases? I can work on this and open a PR (or change the existing one) if you want.

tats-u commented 3 months ago

https://crates.io/crates/enum-map

This crate looks better than CpNumber(u16), which can't prevent the typo of the code page number by users who want to use a single code page. Also I think it allows invalid objects like let foo = CpNumber(1252); (not CpNumber::new(1252)) by a user.

Your suggested API looks suitable for one-shot or casual/informal/simple usage because the library can't easily cache the reference of a encoding/decoding table. An additional static hash map (recent code page numbers → tables) object will be required.

mkroening commented 3 months ago

This crate looks better than CpNumber(u16)

I thought we need a way to go from u16 to code page at runtime, no?

If not, we can just use enums and don't need enum-map or what exactly were you thinking of?

Also I think it allows invalid objects like let foo = CpNumber(1252); (not CpNumber::new(1252)) by a user.

This is not possible, as the inner field is not public. Creation can only happen through new.

library can't easily cache the reference of a encoding/decoding table

How about putting the reference into the number to allow reuse? This makes the struct bigger, though. I am not sure if a phf lookup in exchange for smaller structs might be beneficial here, but we can do it if you want:

pub struct CpNumber {
    encode_table: &'static Map,
    decode_table: &'static Map,
}

impl CpNumber {
    pub fn new(n: u16) -> Self {
        /// look up the table
    }
}