totalgood / nlpia

Examples and libraries for "Natural Language Processing in Action" book
http://bit.ly/gh-nlpia-book
MIT License
622 stars 267 forks source link

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 360: character maps to <undefined> #34

Open narasimha1805 opened 4 years ago

narasimha1805 commented 4 years ago

Getting 'UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 360: character maps to ' error while importing word_topic_vectors from nlpia.book.examples.ch04_catdog_las*

Below is the error:

UnicodeDecodeError Traceback (most recent call last)

in ----> 1 from nlpia.book.examples.ch04_catdog_lsa_3x6x16 import word_topic_vectors d:\python\lib\site-packages\nlpia\book\examples\ch04_catdog_lsa_3x6x16.py in 68 tfidfer = TfidfVectorizer(min_df=2, max_df=.6, stop_words=None, token_pattern=r'(?u)\b\w+\b') 69 ---> 70 corpus = get_data('cats_and_dogs')[:NUM_DOCS] 71 docs = normalize_corpus_words(corpus, stemmer=None) 72 tfidf_dense = pd.DataFrame(tfidfer.fit_transform(docs).todense()) d:\python\lib\site-packages\nlpia\loaders.py in get_data(name, nrows, limit) 1111 return filepaths[name] 1112 elif name in DATASET_NAME2FILENAME: -> 1113 return read_named_csv(name, nrows=nrows) 1114 elif name in DATA_NAMES: 1115 return read_named_csv(DATA_NAMES[name], nrows=nrows) d:\python\lib\site-packages\nlpia\loaders.py in read_named_csv(name, data_path, nrows, verbose) 1003 name = DATASET_NAME2FILENAME[name] 1004 if name.lower().endswith('.txt') or name.lower().endswith('.txt.gz'): -> 1005 return read_text(os.path.join(data_path, name), nrows=nrows) 1006 else: 1007 return read_csv(os.path.join(data_path, name), nrows=nrows) d:\python\lib\site-packages\nlpia\futil.py in read_text(forfn, nrows, verbose) 416 """ 417 tqdm_prog = tqdm if verbose else no_tqdm --> 418 nrows = wc(forfn, nrows=nrows) # not necessary when nrows==None 419 lines = np.empty(dtype=object, shape=nrows) 420 with ensure_open(forfn) as f: d:\python\lib\site-packages\nlpia\futil.py in wc(f, verbose, nrows) 48 tqdm_prog = tqdm if verbose else no_tqdm 49 with ensure_open(f, mode='r') as fin: ---> 50 for i, line in tqdm_prog(enumerate(fin)): 51 if nrows is not None and i >= nrows - 1: 52 break d:\python\lib\encodings\cp1252.py in decode(self, input, final) 21 class IncrementalDecoder(codecs.IncrementalDecoder): 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] 24 25 class StreamWriter(Codec,codecs.StreamWriter): UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1592: character maps to
woo9904 commented 3 years ago

It is a UnicodeDecodeError. Maybe this example can make some help for understanding about the error.

file = open(filename, encoding="utf8")

For a solution, find futil.py file installed in your computer. (d:\python\lib\site-packages\nlpia\futil.py)

find function named ensure_open and edit some code.

fin = f
if isinstance(f, basestring):
    if len(f) <= MAX_LEN_FILEPATH:
        f = find_filepath(f) or f
        if f and (not hasattr(f, 'seek') or not hasattr(f, 'readlines')):
            if f.lower().endswith('.gz'):
                return gzip.open(f, mode=mode,encoding='UTF-8')
            return open(f, mode=mode,encoding='UTF-8')
        f = fin  # reset path in case it is the text that needs to be opened with StringIO
    else:
        f = io.StringIO(f)
elif f and getattr(f, 'closed', None):
    if hasattr(f, '_write_gzip_header'):
        return gzip.open(f.name, mode=mode,encoding='UTF-8')
    else:
        return open(f.name, mode=mode,encoding='UTF-8')
return f

I just added ",encoding='UTF-8'" when open() comes out.

danielgran commented 3 years ago

Doesn't work for me either, whats the problem?

danielgran commented 3 years ago

It is a UnicodeDecodeError. Maybe this example can make some help for understanding about the error.

file = open(filename, encoding="utf8")

For a solution, find futil.py file installed in your computer. (d:\python\lib\site-packages\nlpia\futil.py)

find function named ensure_open and edit some code.

fin = f
if isinstance(f, basestring):
    if len(f) <= MAX_LEN_FILEPATH:
        f = find_filepath(f) or f
        if f and (not hasattr(f, 'seek') or not hasattr(f, 'readlines')):
            if f.lower().endswith('.gz'):
                return gzip.open(f, mode=mode,encoding='UTF-8')
            return open(f, mode=mode,encoding='UTF-8')
        f = fin  # reset path in case it is the text that needs to be opened with StringIO
    else:
        f = io.StringIO(f)
elif f and getattr(f, 'closed', None):
    if hasattr(f, '_write_gzip_header'):
        return gzip.open(f.name, mode=mode,encoding='UTF-8')
    else:
        return open(f.name, mode=mode,encoding='UTF-8')
return f

I just added ",encoding='UTF-8'" when open() comes out.

Unfortunately that prints this error: File "gensim/_matutils.pyx", line 1, in init gensim._matutils ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

danielgran commented 3 years ago

It is a UnicodeDecodeError. Maybe this example can make some help for understanding about the error. file = open(filename, encoding="utf8") For a solution, find futil.py file installed in your computer. (d:\python\lib\site-packages\nlpia\futil.py) find function named ensure_open and edit some code.

fin = f
if isinstance(f, basestring):
    if len(f) <= MAX_LEN_FILEPATH:
        f = find_filepath(f) or f
        if f and (not hasattr(f, 'seek') or not hasattr(f, 'readlines')):
            if f.lower().endswith('.gz'):
                return gzip.open(f, mode=mode,encoding='UTF-8')
            return open(f, mode=mode,encoding='UTF-8')
        f = fin  # reset path in case it is the text that needs to be opened with StringIO
    else:
        f = io.StringIO(f)
elif f and getattr(f, 'closed', None):
    if hasattr(f, '_write_gzip_header'):
        return gzip.open(f.name, mode=mode,encoding='UTF-8')
    else:
        return open(f.name, mode=mode,encoding='UTF-8')
return f

I just added ",encoding='UTF-8'" when open() comes out.

Unfortunately that prints this error: File "gensim/_matutils.pyx", line 1, in init gensim._matutils ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

ah, nvm. this fixes it. thank you!

hsang commented 3 years ago

It is a UnicodeDecodeError. Maybe this example can make some help for understanding about the error.

file = open(filename, encoding="utf8")

For a solution, find futil.py file installed in your computer. (d:\python\lib\site-packages\nlpia\futil.py)

find function named ensure_open and edit some code.

fin = f
if isinstance(f, basestring):
    if len(f) <= MAX_LEN_FILEPATH:
        f = find_filepath(f) or f
        if f and (not hasattr(f, 'seek') or not hasattr(f, 'readlines')):
            if f.lower().endswith('.gz'):
                return gzip.open(f, mode=mode,encoding='UTF-8')
            return open(f, mode=mode,encoding='UTF-8')
        f = fin  # reset path in case it is the text that needs to be opened with StringIO
    else:
        f = io.StringIO(f)
elif f and getattr(f, 'closed', None):
    if hasattr(f, '_write_gzip_header'):
        return gzip.open(f.name, mode=mode,encoding='UTF-8')
    else:
        return open(f.name, mode=mode,encoding='UTF-8')
return f

I just added ",encoding='UTF-8'" when open() comes out.

Thanks, it works!