jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
589 stars 51 forks source link

[BUG] Regression and change of behaviour between 3.3.0 and 3.3.1 #520

Closed pombredanne closed 2 months ago

pombredanne commented 2 months ago

Describe the bug The detection of encoding did change recently, and IMHO regressed (I found that in a CI failure https://dev.azure.com/nexB/commoncode/_build/results?buildId=14502&view=logs&jobId=ba20146e-138e-5341-c558-bc25972fe2bd&j=ba20146e-138e-5341-c558-bc25972fe2bd&t=18eddfd8-abe5-5f8c-405c-5d0e0bd4c25d ) where we use beautifulsoup4 that uses in turn charset_normalizer.

To Reproduce Note that I am using bs4 UnicodeDammit to show the side effects. I added the encoding detection that to see the charset_normalizer side:

Up to 3.2.0 the behavior is stable:

$ pip install beautifulsoup4==4.12.3
$ pip install charset-normalizer==3.2.0
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
/includes/webform.componŇŞnts.inc/
$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
windows-1250

Note the small change in 3.3.0

$ pip install charset-normalizer==3.3.0
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
/includes/webform.compon훩nts.inc/
$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
johab

Note the big change in 3.3.1

$ pip install charset-normalizer==3.3.1
$ python -c "from bs4.dammit import UnicodeDammit;print(UnicodeDammit(b'/includes/webform.compon\xd2\xaants.inc/').markup)"
⽩湣汵摥猯睥扦潲洮捯浰潮튪湴献楮振

$ python -c "import charset_normalizer as cn; print(cn.detect(b'/includes/webform.compon\xd2\xaants.inc/')['encoding'])"
utf_16_be

Expected behavior

I would expect the behavior of 3.2.0 or 3.3.0 as correct. The 3.3.1 is not correct or if it is, then this should be IMHO an API breaking major version bump

Desktop (please complete the following information):

pombredanne commented 2 months ago

I reckon that on the surface this issue seems to be related to https://github.com/jawah/charset_normalizer/issues/391 ... but IMHO this is still a bug, as a single character or a small minority of characters should not dictate the whole encoding of the larger string that contains them.

And I would NOT expect that the behavior would change so drastically without a version bump, as this package is a dependency on pip, requests and other popular packages. The behavior of of 3.3.0 or 3.2.0 is OK, 3.3.1 should become a 4.0.0 if you do not consider these changes as a regression.

(I am assuming may be incorrectly that you use some ki9nd of semver'ish versioning scheme)

Ousret commented 2 months ago

but IMHO this is still a bug, as a single character or a small minority of characters should not dictate the whole encoding of the larger string that contains them.

Indeed, this behavior is not ideal across minor.

And I would NOT expect that the behavior would change so drastically without a version bump...IMHO an API breaking major version bump

Don't forget that we handle an heuristic algorithm and covering all the cases hosted on all other project can be next to impossible.

ki9nd of semver'ish versioning scheme)

We follow semver as best as we can.

Nevertheless, we fixed the presented case, and it will be available in the next minor.

pombredanne commented 2 months ago

@Ousret Thanks! :heart: