kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.97k stars 341 forks source link

Test failures with cchardet-2.1.7 and chardet are installed #318

Open mgorny opened 2 years ago

mgorny commented 2 years ago

When cchardet-2.1.7 and chardet-5.0.0 are both installed, the following tests fail.

FWICS two of them fail because of encoding name mismatches (expected is mixed-case, the value is uppercase), and two of them are recognized as a superset-encoding of the specified encoding (i.e. EUC-KR as UHC, and GB2312 as GB18030).

...F...FF.F.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
======================================================================
FAIL: test_001742 (__main__.TestCase)
./tests/illformed/chardet/windows1255.xml: windows-1255 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'windows-1255'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as WINDOWS-1255'),
 'content-type': '',
 'encoding': 'WINDOWS-1255',
 'entries': [{'summary': 'האם תדפיס נייר של אתר אינטרנט שמוצג על מסך משתמש הוא '
                         'העתק נאמן למקור של אתר האינטרנט? רבים יגידו שכן, '
                         'ולפעמים גם בתי המשפט יצטרפו אליהם שיקבלו פלט מאתר '
                         'אינטרנט כראיה קבילה. אבל, זה ממש לא כך. ויש אפילו '
                         'הוכחה מדהימה.',
              'summary_detail': {'base': '',
                                 'language': None,
                                 'type': 'text/html',
                                 'value': 'האם תדפיס נייר של אתר אינטרנט שמוצג '
                                          'על מסך משתמש הוא העתק נאמן למקור של '
                                          'אתר האינטרנט? רבים יגידו שכן, '
                                          'ולפעמים גם בתי המשפט יצטרפו אליהם '
                                          'שיקבלו פלט מאתר אינטרנט כראיה '
                                          'קבילה. אבל, זה ממש לא כך. ויש אפילו '
                                          'הוכחה מדהימה.'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001746 (__main__.TestCase)
./tests/illformed/chardet/gb2312.xml: GB2312 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'GB2312'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as GB18030'),
 'content-type': '',
 'encoding': 'GB18030',
 'entries': [{'title': '不归移民漫画系列:专业工作',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': '不归移民漫画系列:专业工作'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001747 (__main__.TestCase)
./tests/illformed/chardet/euckr.xml: EUC-KR with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'EUC-KR'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as UHC'),
 'content-type': '',
 'encoding': 'UHC',
 'entries': [{'summary': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 된 닉네임을 정할 경우에, '
                         'EUC-KR로 된 무버블타입 블록에선 리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 '
                         '깨어져 나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 한글로 사용하는 많은 분들도 '
                         '타입키에서의 닉네임은 이런 문제때문에 울며겨자먹기로 영어로 짓고 있다....',
              'summary_detail': {'base': '',
                                 'language': None,
                                 'type': 'text/html',
                                 'value': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 '
                                          '된 닉네임을 정할 경우에, EUC-KR로 된 무버블타입 블록에선 '
                                          '리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 깨어져 '
                                          '나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 '
                                          '한글로 사용하는 많은 분들도 타입키에서의 닉네임은 이런 '
                                          '문제때문에 울며겨자먹기로 영어로 짓고 있다....'},
              'title': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001749 (__main__.TestCase)
./tests/illformed/chardet/big5.xml: Big5 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'Big5'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as BIG5'),
 'content-type': '',
 'encoding': 'BIG5',
 'entries': [],
 'feed': {'title': '我希望??很容易?其翻?成中文,并有助于改??件。 感?您??本文。',
          'title_detail': {'base': '',
                           'language': None,
                           'type': 'text/plain',
                           'value': '我希望??很容易?其翻?成中文,并有助于改??件。 感?您??本文。'}},
 'headers': {},
 'namespaces': {'': 'http://www.w3.org/2005/Atom'},
 'version': 'atom10'})

----------------------------------------------------------------------
Ran 4354 tests in 4.892s

FAILED (failures=4)
maksverver commented 2 months ago

I ran into the same problem. Here's a snippet that can be used to show the differences between chardet and cchardet.

import cchardet
import chardet
import glob

for path in glob.glob('tests/illformed/chardet/*'):
    data = open(path, 'rb').read()
    enc1 = chardet.detect(data)['encoding']
    enc2 = cchardet.detect(data)['encoding']
    print('%-40s %-20s %-20s %s' % (path, enc1, enc2, 'same' if enc1 == enc2 else 'different'))
tests/illformed/chardet/koi8r.xml        KOI8-R               KOI8-R               same
tests/illformed/chardet/windows1255.xml  windows-1255         WINDOWS-1255         different
tests/illformed/chardet/gb2312.xml       GB2312               GB18030              different
tests/illformed/chardet/big5.xml         Big5                 BIG5                 different
tests/illformed/chardet/shiftjis.xml     SHIFT_JIS            SHIFT_JIS            same
tests/illformed/chardet/eucjp.xml        EUC-JP               EUC-JP               same
tests/illformed/chardet/euckr.xml        EUC-KR               UHC                  different
tests/illformed/chardet/tis620.xml       TIS-620              TIS-620              same