nickspring / charset-normalizer-rs

Truly universal encoding detector in pure Rust - port of Python version
https://crates.io/crates/charset-normalizer-rs
MIT License
30 stars 3 forks source link

Fixes #24

Closed chris-ha458 closed 1 year ago

chris-ha458 commented 1 year ago

Being more strict with cargo clippy and cargo test looking more into cargo clippy -- -Wclippy::pedantic

As said before, pedantic is just that. Pedantic. There are many false positives and we don't need to fix all of them. But it still would be valuable to understand why they are false positives and document them if necessary.

chris-ha458 commented 1 year ago

more drops in accuracy. I should reassess

chris-ha458 commented 1 year ago

ah after pulling in 9217701 the accuracy is the same again.

nickspring commented 1 year ago

Hm... strange but in original lib these changes didn't impact accuracy.

chris-ha458 commented 1 year ago

considering they are 0.3% difference, we could maybe lay out every single example and compare which is processed which

  1. original (97.1)
  2. fix found in python version (96.8)
  3. canonical fix (95.8)

if you want that prioritized, we can do it, but since these fixes result in same as current main, they can be merged regardless.

Of course if you'd rather stop any other fixes before the root cause for the 1,2,3 discrepency has been found, we can do that as well.

nickspring commented 1 year ago

I've checked. All cases have zeros (compare 0 vs 0). I believe we should return this code but eliminate this case (if multibyte usages are 0 and 0 we should compare mess). Please try to return it with this condition.

chris-ha458 commented 1 year ago

Do you want me to do that on top of here or on a separate PR focused on only this issue? (I think that would be better)

nickspring commented 1 year ago

Hm, you're right it will be better to have it in separate PR. Concerning this PR - I've sent question about some change which I cannot understand.

chris-ha458 commented 1 year ago

I cannot see the questions for this PR. Did you leave them as code comments?