Closed Xenira closed 2 weeks ago
What was used to generate the hash table? Turns out -
and _
do not have the same value, just not one of the checked characters for koi8-r. Using the same value results in collisions.
Generally the code is not really maintainable in its current state.
Maybe use a macro to generate the tables? There is https://github.com/rust-phf/rust-phf or I could write something custom.
It was generated by gperf. I'll update the table as soon as I have some time. Moving to rust-phf is not an option for this crate.
I can do this as well. Now that I know how it was generated i should be able to update it.
Just out of curiosity, why would a macro (like rust-phf) not be an option here?
Just out of curiosity, why would a macro (like rust-phf) not be an option here?
rust-phf maps strings to enum variants and this crate needs to map strings to decoder function references. So if rust-phf is used an extra branch is required to map the enum variant to the decoder function reference.
rust-phf maps strings to enum variants and this crate needs to map strings to decoder function references. So if rust-phf is used an extra branch is required to map the enum variant to the decoder function reference.
Tested it and it works with DecoderFnc
.
static KEYWORDS: phf::Map<&'static [u8], DecoderFnc> = phf::phf_map! {
b"utf-8" => decoder_utf8,
b"utf-16" => decoder_utf16,
b"utf-16be" => decoder_utf16_be,
b"utf-16le" => decoder_utf16_le,
b"utf-7" => decoder_utf7,
[...]
}
and access with
pub fn charset_decoder(charset: &[u8]) -> Option<&DecoderFnc> {
KEYWORDS.get(charset)
}
Although phf
is a proc macro, I wanted to avoid an extra dependency. I just updated the perfect hash with the new mappings that treat both -
and _
as the same character for all charset names.
Thanks for updating the perfect hash.
I can respect not wanting to add a dependency. Some documentation / Adding a script for regenerating the table to the repo would be great though.
Some documentation / Adding a script for regenerating the table to the repo would be great though.
Yes, in the near future I plan to replace this with a perfect hashing macro (that will be used by all Stalwart libraries). Although phf
is great, it lacks support for streaming parsing which is required by some of the Stalwart libraries. For example, header names are hashed directly from a byte stream and, the last time I checked, phf
did not support this.
Please let me know once you've tested this change and I'll publish a new version to crates.io.
The koi-8 is working as expected. Thanks a lot!
I found 4 mails with german umlauts that don't display correctly now. Ill try to figure out if its related and let you know what I find.
Seems like those did not work correctly even before this change. Ill create a new issue for that.
Edit:
Seems like they use invalid charset de-ascii
for ISO-8859-1
or no charset at all instead using Content-Language: de
.
One other is with quoted-printable and using the html decimal code instead of utf-8. Those cases seem broken enough to ignore for now.
The koi-8 is working as expected. Thanks a lot!
Great!
I have some mails using
koi8_r
instead ofkoi8-r
for koi8 encoding.While technically wrong I guess it would be good to allow
_
instead of-
(and vice versa) for encodings to be more resilient. The hash already treats them as equal.I could create an MR if this is something you think would work.