lifthrasiir / rust-encoding

Character encoding support for Rust
MIT License
284 stars 59 forks source link

`all::encodings()` returns an errornous list (and should be sorted alphabetically). #104

Closed getreu closed 8 years ago

getreu commented 8 years ago

The bug is in src/all.rs:

   const ENCODINGS: &'static [EncodingRef] = &[ ....

This is the way I collect the list (please confirm that it should be done this way):

   let list = all::encodings().iter().map(|&e|format!(" {}\n",e.name())).collect::<String>();

the following names do not work:

 error mac-roman mac-roman mac-cyrillic hz big5-2003 pua-mapped-binary encoder-only-utf-8

These two do work but are not listed:

  x-user-defined macintosh

PLEASE return them alphabetically sorted!

lifthrasiir commented 8 years ago

@getreu I'm confused as to what do you mean by names "not working".

My best guess is that you are trying to get encodings by encoding::label::encoding_from_whatwg_label function, which is (as the name suggests) a lookup by the names from WHATWG Encoding standard, which may have a different name from Encoding::name (which is a descriptive, arbitrary name for the inspection only). I believe the function names are clear enough that this does not need any fix.

There is no guarantee that encoding::all::encodings() will return encodings in any particular order, and I cannot see any benefit in doing so (enumerating encodings is extremely rare). If you think that this should be communicated clearly, please file a pull request with documentation added.

getreu commented 8 years ago

Maybe I am using your API not in the right way:

a) My tools prints a list with all encodings supported:

let list = all::encodings().iter().map(|&e|format!(" {}\n",e.name())).collect::<String>();

b) Then the user choses one entry from the list and enters a string enc_name with the encoding name:

encoding_from_whatwg_label(enc_name.as_str())

These encodings are listed in a) but are not understood in b)

error mac-roman mac-roman mac-cyrillic hz big5-2003 pua-mapped-binary encoder-only-utf-8
lifthrasiir commented 8 years ago

First, the primary WHATWG name (which encoding_from_whatwg_label accepts) is available from Encoding::whatwg_name (which can be absent, by the way, if it's not in the standard).

I think you are doing this in the wrong way, however. You should have a separate mapping from the name (whatever it is) to the encoding object, like HashMap<&'static str, EncodingRef> (or BTreeMap if you want sorted ones). The Encoding standard often changes the entire set of aliases, primarily to reflect what the current web browsers should do. Also I haven't explicitly mentioned this but it's my intention for all Encoding-supported encodings to have unique, descriptive and semi-stable [1] Encoding::names, so if you need some human-readable names you can at least use it as a unique identifier.

[1] That is, while I will keep a unique name for each encoding (so that you can, say, map "big5-2003" to the actual encoding mostly described by Big5-2003 and also to more human-readable name like "Traditional Chinese (Big5-2003)"), if the encoding implementation has changed substantially I may change the name without any explicit support for the prior encoding (which is generally hard). Or you can commit to whatever web browsers do by using a stable WHATWG name instead, but I guess this is not your use case.

getreu commented 8 years ago

Thanks a lot for your help! I modified my supported encodings list generation code as follows:

let list = all::encodings().iter().filter_map(|&e|e.whatwg_name()).sorted();

As you can see I sorted the list myself. I agree with you that this feature is not of interest for many people.