Switch from CRLF to LF line feed made the detector return different guess

BLKSerene commented 1 year ago

Describe the bug The issue was found when testing Charset Normalizer on CI running different OSes.

To Reproduce

>>> import charset_normalizer
>>> text = '''English is a West Germanic language of the Indo-European language family, with its earliest forms spoken by the inhabitants of early medieval England.[3][4][5] It is named after the Angles, one of the ancient Germanic peoples that migrated to the island of Great Britain. English is genealogically West Germanic, closest related to the Low Saxon and Frisian languages; however, its vocabulary is also distinctively influenced by dialects of French (about 29% of modern English words) and Latin (also about 29%), plus some grammar and a small amount of core vocabulary influenced by Old Norse (a North Germanic language).[6][7][8] Speakers of English are called Anglophones.

The earliest forms of English, collectively known as Old English, evolved from a group of West Germanic (Ingvaeonic) dialects brought to Great Britain by Anglo-Saxon settlers in the 5th century and further mutated by Norse-speaking Viking settlers starting in the 8th and 9th centuries. Middle English began in the late 11th century after the Norman conquest of England, when considerable French (especially Old Norman) and Latin-derived vocabulary was incorporated into English over some three hundred years.[9][10] Early Modern English began in the late 15th century with the start of the Great Vowel Shift and the Renaissance trend of borrowing further Latin and Greek words and roots into English, concurrent with the introduction of the printing press to London. This era notably culminated in the King James Bible and plays of William Shakespeare.[11][12]

Modern English grammar is the result of a gradual change from a typical Indo-European dependent-marking pattern, with a rich inflectional morphology and relatively free word order, to a mostly analytic pattern with little inflection, and a fairly fixed subject–verb–object word order.[13] Modern English relies more on auxiliary verbs and word order for the expression of complex tenses, aspect and mood, as well as passive constructions, interrogatives and some negation.

Modern English has spread around the world since the 17th century as a consequence of the worldwide influence of the British Empire and the United States of America. Through all types of printed and electronic media of these countries, English has become the leading language of international discourse and the lingua franca in many regions and professional contexts such as science, navigation and law.[3] English is the most spoken language in the world[14] and the third-most spoken native language in the world, after Standard Chinese and Spanish.[15] It is the most widely learned second language and is either the official language or one of the official languages in 59 sovereign states. There are more people who have learned English as a second language than there are native speakers. As of 2005, it was estimated that there were over 2 billion speakers of English.[16] English is the majority native language in the United Kingdom, the United States, Canada, Australia, New Zealand and the Republic of Ireland (see Anglosphere), and is widely spoken in some areas of the Caribbean, Africa, South Asia, Southeast Asia, and Oceania.[17] It is a co-official language of the United Nations, the European Union and many other world and regional international organisations. It is the most widely spoken Germanic language, accounting for at least 70% of speakers of this Indo-European branch.'''# From wikipedia

>>> open('test.txt', 'w', encoding = 'utf_16_be', newline = '\r\n').write(text) # Windows-style line endings
3409
>>> charset_normalizer.from_path('test.txt').best().encoding # Correct!
'utf_16_be'
>>> open('test.txt', 'w', encoding = 'utf_16_be', newline = '\n').write(text) # Unix/Linux-style line endings
3409
>>> charset_normalizer.from_path('test.txt').best().encoding # Wrong!
'utf_16_le'

Expected behavior Always return 'utf_16_be' on different OSes

Desktop (please complete the following information):

OS: Windows 11 x64
Python version: 3.8.10 x64
Package version: 3.0.0

Ousret commented 1 year ago

The title is a bit misleading, in the end, I understood that "Passing CRLF to LF" made the detector return something else. I took the time trying to reproduce your issue and could not. I have initially done the testing in 3.11 then by pure curiosity setup 3.8.10. Using Windows 11 and Ubuntu. Nothing seems wrong. Got every time UTF-16-BE.

If your reproducing script was not accurate and you re-verified, re-open this issue with complementary info.

BLKSerene commented 1 year ago

@Ousret Sorry for the confusion, the text is missing some sentences. I've modified the code (the return value of open should be exactly 3409 now).

BLKSerene commented 1 year ago

I can't reopen this issue (or should I open a new one?), if you could re-verify this, please re-open it.

Ousret commented 1 year ago

OK. The reproducer script now outputs what you encountered. I have narrowed it down to utils.cut_sequence_chunks which did not cut chunks correctly.

Ousret commented 1 year ago

See #233

jawah / charset_normalizer

Switch from CRLF to LF line feed made the detector return different guess #232