stalwartlabs / mail-parser

Fast and robust e-mail parsing library for Rust
https://docs.rs/mail-parser/
Apache License 2.0
289 stars 39 forks source link

KOI8-R encoding with `koi8_r` instead of `koi8-r` as encoding string #85

Closed Xenira closed 2 weeks ago

Xenira commented 1 month ago

I have some mails using koi8_r instead of koi8-r for koi8 encoding.

While technically wrong I guess it would be good to allow _ instead of - (and vice versa) for encodings to be more resilient. The hash already treats them as equal.

I could create an MR if this is something you think would work.

Xenira commented 3 weeks ago

What was used to generate the hash table? Turns out - and _ do not have the same value, just not one of the checked characters for koi8-r. Using the same value results in collisions.

Generally the code is not really maintainable in its current state.

Maybe use a macro to generate the tables? There is https://github.com/rust-phf/rust-phf or I could write something custom.

mdecimus commented 3 weeks ago

It was generated by gperf. I'll update the table as soon as I have some time. Moving to rust-phf is not an option for this crate.

Xenira commented 3 weeks ago

I can do this as well. Now that I know how it was generated i should be able to update it.

Just out of curiosity, why would a macro (like rust-phf) not be an option here?

mdecimus commented 3 weeks ago

Just out of curiosity, why would a macro (like rust-phf) not be an option here?

rust-phf maps strings to enum variants and this crate needs to map strings to decoder function references. So if rust-phf is used an extra branch is required to map the enum variant to the decoder function reference.

Xenira commented 3 weeks ago

rust-phf maps strings to enum variants and this crate needs to map strings to decoder function references. So if rust-phf is used an extra branch is required to map the enum variant to the decoder function reference.

Tested it and it works with DecoderFnc.

static KEYWORDS: phf::Map<&'static [u8], DecoderFnc> = phf::phf_map! {
    b"utf-8" => decoder_utf8,
    b"utf-16" => decoder_utf16,
    b"utf-16be" => decoder_utf16_be,
    b"utf-16le" => decoder_utf16_le,
    b"utf-7" => decoder_utf7,
    [...]
}

and access with

pub fn charset_decoder(charset: &[u8]) -> Option<&DecoderFnc> {
   KEYWORDS.get(charset)
}
mdecimus commented 2 weeks ago

Although phf is a proc macro, I wanted to avoid an extra dependency. I just updated the perfect hash with the new mappings that treat both - and _ as the same character for all charset names.

Xenira commented 2 weeks ago

Thanks for updating the perfect hash.

I can respect not wanting to add a dependency. Some documentation / Adding a script for regenerating the table to the repo would be great though.

mdecimus commented 2 weeks ago

Some documentation / Adding a script for regenerating the table to the repo would be great though.

Yes, in the near future I plan to replace this with a perfect hashing macro (that will be used by all Stalwart libraries). Although phf is great, it lacks support for streaming parsing which is required by some of the Stalwart libraries. For example, header names are hashed directly from a byte stream and, the last time I checked, phf did not support this.

Please let me know once you've tested this change and I'll publish a new version to crates.io.

Xenira commented 2 weeks ago

The koi-8 is working as expected. Thanks a lot!

I found 4 mails with german umlauts that don't display correctly now. Ill try to figure out if its related and let you know what I find.

Xenira commented 2 weeks ago

Seems like those did not work correctly even before this change. Ill create a new issue for that.

Edit: Seems like they use invalid charset de-ascii for ISO-8859-1 or no charset at all instead using Content-Language: de. One other is with quoted-printable and using the html decimal code instead of utf-8. Those cases seem broken enough to ignore for now.

mdecimus commented 1 week ago

The koi-8 is working as expected. Thanks a lot!

Great!