Recover accuracy - Githubissues

nickspring / charset-normalizer-rs

Truly universal encoding detector in pure Rust - port of Python version

https://crates.io/crates/charset-normalizer-rs

MIT License

30 stars 3 forks source link

Recover accuracy #25

Closed chris-ha458 closed 1 year ago

chris-ha458 commented 1 year ago

If mess_difference <0.01 we see if coherence_difference > 0.02 and return partialord based on that. If not, we try to use multibyte usage difference if it is big enough.

comparing with multibyte_usage_a.abs() > f32::epsilon is idiomatic and includes when the value is 0.0 or some value very close to it.

However, it does not change the final accuracy at all.

chris-ha458 commented 1 year ago

each commit represents different ways to represent the same idea, but none make a difference.

nickspring commented 1 year ago

If think it is not correct. The situation when multibyte_a = 0 and multibyte_b != 0 is totally correct. We just shouldn't do decision if multibyte_a == multibyte_b for example (0 and 0, 3 and 3, etc) (mess should be used).

chris-ha458 commented 1 year ago

If think it is not correct. The situation when multibyte_a = 0 and multibyte_b != 0 is totally correct. We just shouldn't do decision if multibyte_a == multibyte_b for example (0 and 0, 3 and 3, etc) (mess should be used).

i'm not sure if i fully understand what you mean. Can you show me code or maybe pseudocode(if else) what you mean?

chris-ha458 commented 1 year ago

If i Understand correctly, on your system the final accuracy results in 97.1%? using
cargo run --release --bin performance --all-features |tail -n 50 my system shows a result of 96.8%

--> A) charset-normalizer-rs Conclusions
   --> Accuracy: 96.8%
   --> Total time: 642.285389ms
   --> Avg time: 1.570379ms
   --> 50th: 662.846µs
   --> 95th: 4.530332ms
   --> 99th: 11.080102ms

I'm I checking this right? Is your system showing higher than 97.0% under same code?

nickspring commented 1 year ago

You could always check accuracy and speed in the output of performance action https://github.com/nickspring/charset-normalizer-rs/actions/runs/6378885659/job/17310431300?pr=25 I see 97.1% here and I have 97.1% locally. What OS do you have?

chris-ha458 commented 1 year ago

Ah you are correct.

I am using WSL2 but I plan to setup Windows and Linux (via virtualbox) workflows.

nickspring commented 1 year ago

Interesting :) maybe for this platform encoding library offers fewer encodings...