chardet/chardet
### [`v4.0.0`](https://togithub.com/chardet/chardet/releases/4.0.0)
[Compare Source](https://togithub.com/chardet/chardet/compare/3.0.4...4.0.0)
⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ ⚠️
##### Major Changes
This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:
1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See [#121](https://togithub.com/chardet/chardet/issues/121) for details)
2. The `CharsetGroupProber` class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.
3. There is now a `chardet.detect_all` function that returns a list of possible encodings for the input with associated confidences.
4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.
The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see [#99](https://togithub.com/chardet/chardet/issues/99) for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).
##### Benchmarks
Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM
##### old version (chardet 3.0.4)
Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 25559.439366240098
big5: 7.187002209518091
cp932: 4.71090956645177
cp949: 2.937256786994428
euc-jp: 4.870580412090848
euc-kr: 6.6910755971933416
euc-tw: 87.71098043480079
gb2312: 6.614302607154443
ibm855: 27.595893549680685
ibm866: 29.93483661732791
iso-2022-jp: 3379.5052775763434
iso-2022-kr: 26181.67290886392
iso-8859-1: 120.63424740403983
iso-8859-5: 32.65106262196898
iso-8859-7: 62.480089080556084
koi8-r: 13.72481001727257
maccyrillic: 33.018537255804496
shift_jis: 4.996013583677438
tis-620: 14.323112928341818
utf-16: 166771.53081510935
utf-32: 198782.18009478672
utf-8: 13.966236809766901
utf-8-sig: 193732.28637413395
windows-1251: 23.038910006925768
windows-1252: 99.48409117053738
windows-1255: 6.336261495718825
Total time: 357.05358052253723s (10.054513372323958 calls per second)
##### new version (chardet 4.0.0)
Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:
ascii: 38176.31067961165
big5: 12.86915132656389
cp932: 4.656400877065864
cp949: 7.282976434315926
euc-jp: 4.329381447610525
euc-kr: 8.16386823884839
euc-tw: 90.230745070368
gb2312: 14.248865889128146
ibm855: 33.30225548069821
ibm866: 44.181691968506
iso-2022-jp: 3024.2295767539117
iso-2022-kr: 25055.57945041816
iso-8859-1: 59.25262902122995
iso-8859-5: 39.7069713674529
iso-8859-7: 61.008422013862194
koi8-r: 41.21560517643845
maccyrillic: 31.402474369805002
shift_jis: 4.9091652743515155
tis-620: 14.408875278821073
utf-16: 177349.00634249471
utf-32: 186413.51111111112
utf-8: 108.62174360115105
utf-8-sig: 181965.46637744035
windows-1251: 43.16933400329809
windows-1252: 211.27653358317968
windows-1255: 16.15113643694104
Total time: 268.0230791568756s (13.394368915143872 calls per second)
Thank you to [@aaaxx](https://togithub.com/aaaxx), [@edumco](https://togithub.com/edumco), [@hrnciar](https://togithub.com/hrnciar), [@hroncok](https://togithub.com/hroncok), [@jdufresne](https://togithub.com/jdufresne), [@mdamien](https://togithub.com/mdamien), [@saintamh](https://togithub.com/saintamh) , [@xeor](https://togithub.com/xeor) for submitting pull requests, to all of our users for being patient with how long this release has taken.
##### Full changelog
- Convert single-byte charset probers to use nested dicts for language models ([#121](https://togithub.com/chardet/chardet/issues/121)) [@dan-blanchard](https://togithub.com/dan-blanchard)
- Add API option to get all the encodings confidence ([#111](https://togithub.com/chardet/chardet/issues/111)) [@mdamien](https://togithub.com/mdamien)
- Make sure pyc files are not in tarballs ([`d7c7343`](https://togithub.com/chardet/chardet/commit/d7c7343)) [@dan-blanchard](https://togithub.com/dan-blanchard)
- Add benchmark script ([`d702545`](https://togithub.com/chardet/chardet/commit/d702545), [`8dccd00`](https://togithub.com/chardet/chardet/commit/8dccd00), [`726973e`](https://togithub.com/chardet/chardet/commit/726973e), [`71a0fad`](https://togithub.com/chardet/chardet/commit/71a0fad)) [@dan-blanchard](https://togithub.com/dan-blanchard)
- Include license file in the generated wheel package ([#141](https://togithub.com/chardet/chardet/issues/141)) [@jdufresne](https://togithub.com/jdufresne)
- Drop support for Python 2.6 ([#143](https://togithub.com/chardet/chardet/issues/143)) [@jdufresne](https://togithub.com/jdufresne)
- Remove unused coverage configuration ([#142](https://togithub.com/chardet/chardet/issues/142)) [@jdufresne](https://togithub.com/jdufresne)
- Doc the chardet package suitable for production ([#144](https://togithub.com/chardet/chardet/issues/144)) [@jdufresne](https://togithub.com/jdufresne)
- Pass python_requires argument to setuptools ([#150](https://togithub.com/chardet/chardet/issues/150)) [@jdufresne](https://togithub.com/jdufresne)
- Update pypi.python.org URL to pypi.org ([#155](https://togithub.com/chardet/chardet/issues/155)) [@jdufresne](https://togithub.com/jdufresne)
- Typo fix ([#159](https://togithub.com/chardet/chardet/issues/159)) [@saintamh](https://togithub.com/saintamh)
- Support pytest 4, don't apply marks directly to parameters (PR [#174](https://togithub.com/chardet/chardet/issues/174), Issue [#173](https://togithub.com/chardet/chardet/issues/173)) [@hroncok](https://togithub.com/hroncok)
- Test Python 3.7 and 3.8 and document support ([#175](https://togithub.com/chardet/chardet/issues/175)) [@jdufresne](https://togithub.com/jdufresne)
- Drop support for end-of-life Python 3.4 ([#181](https://togithub.com/chardet/chardet/issues/181)) [@jdufresne](https://togithub.com/jdufresne)
- Workaround for distutils bug in python 2.7 ([#165](https://togithub.com/chardet/chardet/issues/165)) [@xeor](https://togithub.com/xeor)
- Remove deprecated license_file from setup.cfg ([#182](https://togithub.com/chardet/chardet/issues/182)) [@jdufresne](https://togithub.com/jdufresne)
- Remove deprecated 'sudo: false' from Travis configuraiton ([#200](https://togithub.com/chardet/chardet/issues/200)) [@jdufresne](https://togithub.com/jdufresne)
- Add testing for Python 3.9 ([#201](https://togithub.com/chardet/chardet/issues/201)) [@jdufresne](https://togithub.com/jdufresne)
- Adds explicit os and distro definitions ([#140](https://togithub.com/chardet/chardet/issues/140)) [@edumco](https://togithub.com/edumco)
- Remove shebang from nonexecutable script ([#192](https://togithub.com/chardet/chardet/issues/192)) [@hrnciar](https://togithub.com/hrnciar)
- Remove use of deprecated 'setup.py test' ([#187](https://togithub.com/chardet/chardet/issues/187)) [@jdufresne](https://togithub.com/jdufresne)
- Remove unnecessary numeric placeholders from format strings ([#176](https://togithub.com/chardet/chardet/issues/176)) [@jdufresne](https://togithub.com/jdufresne)
- Update links ([#152](https://togithub.com/chardet/chardet/issues/152)) [@aaaxx](https://togithub.com/aaaxx)
- Remove shebang and executable bit from chardet/cli/chardetect.py ([#171](https://togithub.com/chardet/chardet/issues/171)) [@jdufresne](https://togithub.com/jdufresne)
- Handle weird logging edge case in universaldetector.py ([`056a2a4`](https://togithub.com/chardet/chardet/commit/056a2a4)) [@dan-blanchard](https://togithub.com/dan-blanchard)
- Switch from Travis to GitHub Actions ([#204](https://togithub.com/chardet/chardet/issues/204)) [@dan-blanchard](https://togithub.com/dan-blanchard)
- Properly set CharsetGroupProber.state to FOUND_IT (PR [#203](https://togithub.com/chardet/chardet/issues/203), Issue [#202](https://togithub.com/chardet/chardet/issues/202)) [@dan-blanchard](https://togithub.com/dan-blanchard)
- Add language to detect_all output ([`1e208b7`](https://togithub.com/chardet/chardet/commit/1e208b7)) [@dan-blanchard](https://togithub.com/dan-blanchard)
Renovate configuration
:date: Schedule: At any time (no schedule defined).
:vertical_traffic_light: Automerge: Disabled by config. Please merge this manually once you are satisfied.
:recycle: Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
:no_bell: Ignore: Close this PR and you won't be reminded about this update again.
[ ] If you want to rebase/retry this PR, check this box
This PR contains the following updates:
==3.0.4
->==4.0.0
Release Notes
chardet/chardet
### [`v4.0.0`](https://togithub.com/chardet/chardet/releases/4.0.0) [Compare Source](https://togithub.com/chardet/chardet/compare/3.0.4...4.0.0) ⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ ⚠️ ##### Major Changes This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are: 1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See [#121](https://togithub.com/chardet/chardet/issues/121) for details) 2. The `CharsetGroupProber` class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup. 3. There is now a `chardet.detect_all` function that returns a list of possible encodings for the input with associated confidences. 4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life. The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see [#99](https://togithub.com/chardet/chardet/issues/99) for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support). ##### Benchmarks Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM ##### old version (chardet 3.0.4) Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42) [Clang 11.0.3 (clang-1103.0.32.62)] -------------------------------------------------------------------------------- Calls per second for each encoding: ascii: 25559.439366240098 big5: 7.187002209518091 cp932: 4.71090956645177 cp949: 2.937256786994428 euc-jp: 4.870580412090848 euc-kr: 6.6910755971933416 euc-tw: 87.71098043480079 gb2312: 6.614302607154443 ibm855: 27.595893549680685 ibm866: 29.93483661732791 iso-2022-jp: 3379.5052775763434 iso-2022-kr: 26181.67290886392 iso-8859-1: 120.63424740403983 iso-8859-5: 32.65106262196898 iso-8859-7: 62.480089080556084 koi8-r: 13.72481001727257 maccyrillic: 33.018537255804496 shift_jis: 4.996013583677438 tis-620: 14.323112928341818 utf-16: 166771.53081510935 utf-32: 198782.18009478672 utf-8: 13.966236809766901 utf-8-sig: 193732.28637413395 windows-1251: 23.038910006925768 windows-1252: 99.48409117053738 windows-1255: 6.336261495718825 Total time: 357.05358052253723s (10.054513372323958 calls per second) ##### new version (chardet 4.0.0) Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42) [Clang 11.0.3 (clang-1103.0.32.62)] -------------------------------------------------------------------------------- ....................................................................................................................................................................................................................................................................................................................................................................... Calls per second for each encoding: ascii: 38176.31067961165 big5: 12.86915132656389 cp932: 4.656400877065864 cp949: 7.282976434315926 euc-jp: 4.329381447610525 euc-kr: 8.16386823884839 euc-tw: 90.230745070368 gb2312: 14.248865889128146 ibm855: 33.30225548069821 ibm866: 44.181691968506 iso-2022-jp: 3024.2295767539117 iso-2022-kr: 25055.57945041816 iso-8859-1: 59.25262902122995 iso-8859-5: 39.7069713674529 iso-8859-7: 61.008422013862194 koi8-r: 41.21560517643845 maccyrillic: 31.402474369805002 shift_jis: 4.9091652743515155 tis-620: 14.408875278821073 utf-16: 177349.00634249471 utf-32: 186413.51111111112 utf-8: 108.62174360115105 utf-8-sig: 181965.46637744035 windows-1251: 43.16933400329809 windows-1252: 211.27653358317968 windows-1255: 16.15113643694104 Total time: 268.0230791568756s (13.394368915143872 calls per second) Thank you to [@aaaxx](https://togithub.com/aaaxx), [@edumco](https://togithub.com/edumco), [@hrnciar](https://togithub.com/hrnciar), [@hroncok](https://togithub.com/hroncok), [@jdufresne](https://togithub.com/jdufresne), [@mdamien](https://togithub.com/mdamien), [@saintamh](https://togithub.com/saintamh) , [@xeor](https://togithub.com/xeor) for submitting pull requests, to all of our users for being patient with how long this release has taken. ##### Full changelog - Convert single-byte charset probers to use nested dicts for language models ([#121](https://togithub.com/chardet/chardet/issues/121)) [@dan-blanchard](https://togithub.com/dan-blanchard) - Add API option to get all the encodings confidence ([#111](https://togithub.com/chardet/chardet/issues/111)) [@mdamien](https://togithub.com/mdamien) - Make sure pyc files are not in tarballs ([`d7c7343`](https://togithub.com/chardet/chardet/commit/d7c7343)) [@dan-blanchard](https://togithub.com/dan-blanchard) - Add benchmark script ([`d702545`](https://togithub.com/chardet/chardet/commit/d702545), [`8dccd00`](https://togithub.com/chardet/chardet/commit/8dccd00), [`726973e`](https://togithub.com/chardet/chardet/commit/726973e), [`71a0fad`](https://togithub.com/chardet/chardet/commit/71a0fad)) [@dan-blanchard](https://togithub.com/dan-blanchard) - Include license file in the generated wheel package ([#141](https://togithub.com/chardet/chardet/issues/141)) [@jdufresne](https://togithub.com/jdufresne) - Drop support for Python 2.6 ([#143](https://togithub.com/chardet/chardet/issues/143)) [@jdufresne](https://togithub.com/jdufresne) - Remove unused coverage configuration ([#142](https://togithub.com/chardet/chardet/issues/142)) [@jdufresne](https://togithub.com/jdufresne) - Doc the chardet package suitable for production ([#144](https://togithub.com/chardet/chardet/issues/144)) [@jdufresne](https://togithub.com/jdufresne) - Pass python_requires argument to setuptools ([#150](https://togithub.com/chardet/chardet/issues/150)) [@jdufresne](https://togithub.com/jdufresne) - Update pypi.python.org URL to pypi.org ([#155](https://togithub.com/chardet/chardet/issues/155)) [@jdufresne](https://togithub.com/jdufresne) - Typo fix ([#159](https://togithub.com/chardet/chardet/issues/159)) [@saintamh](https://togithub.com/saintamh) - Support pytest 4, don't apply marks directly to parameters (PR [#174](https://togithub.com/chardet/chardet/issues/174), Issue [#173](https://togithub.com/chardet/chardet/issues/173)) [@hroncok](https://togithub.com/hroncok) - Test Python 3.7 and 3.8 and document support ([#175](https://togithub.com/chardet/chardet/issues/175)) [@jdufresne](https://togithub.com/jdufresne) - Drop support for end-of-life Python 3.4 ([#181](https://togithub.com/chardet/chardet/issues/181)) [@jdufresne](https://togithub.com/jdufresne) - Workaround for distutils bug in python 2.7 ([#165](https://togithub.com/chardet/chardet/issues/165)) [@xeor](https://togithub.com/xeor) - Remove deprecated license_file from setup.cfg ([#182](https://togithub.com/chardet/chardet/issues/182)) [@jdufresne](https://togithub.com/jdufresne) - Remove deprecated 'sudo: false' from Travis configuraiton ([#200](https://togithub.com/chardet/chardet/issues/200)) [@jdufresne](https://togithub.com/jdufresne) - Add testing for Python 3.9 ([#201](https://togithub.com/chardet/chardet/issues/201)) [@jdufresne](https://togithub.com/jdufresne) - Adds explicit os and distro definitions ([#140](https://togithub.com/chardet/chardet/issues/140)) [@edumco](https://togithub.com/edumco) - Remove shebang from nonexecutable script ([#192](https://togithub.com/chardet/chardet/issues/192)) [@hrnciar](https://togithub.com/hrnciar) - Remove use of deprecated 'setup.py test' ([#187](https://togithub.com/chardet/chardet/issues/187)) [@jdufresne](https://togithub.com/jdufresne) - Remove unnecessary numeric placeholders from format strings ([#176](https://togithub.com/chardet/chardet/issues/176)) [@jdufresne](https://togithub.com/jdufresne) - Update links ([#152](https://togithub.com/chardet/chardet/issues/152)) [@aaaxx](https://togithub.com/aaaxx) - Remove shebang and executable bit from chardet/cli/chardetect.py ([#171](https://togithub.com/chardet/chardet/issues/171)) [@jdufresne](https://togithub.com/jdufresne) - Handle weird logging edge case in universaldetector.py ([`056a2a4`](https://togithub.com/chardet/chardet/commit/056a2a4)) [@dan-blanchard](https://togithub.com/dan-blanchard) - Switch from Travis to GitHub Actions ([#204](https://togithub.com/chardet/chardet/issues/204)) [@dan-blanchard](https://togithub.com/dan-blanchard) - Properly set CharsetGroupProber.state to FOUND_IT (PR [#203](https://togithub.com/chardet/chardet/issues/203), Issue [#202](https://togithub.com/chardet/chardet/issues/202)) [@dan-blanchard](https://togithub.com/dan-blanchard) - Add language to detect_all output ([`1e208b7`](https://togithub.com/chardet/chardet/commit/1e208b7)) [@dan-blanchard](https://togithub.com/dan-blanchard)Renovate configuration
:date: Schedule: At any time (no schedule defined).
:vertical_traffic_light: Automerge: Disabled by config. Please merge this manually once you are satisfied.
:recycle: Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
:no_bell: Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by WhiteSource Renovate. View repository job log here.