document classification example is broken

mblondel commented 12 years ago

$ python examples/document_classification_20newsgroups.py
[...]

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data loaded
2034 documents (training set)
1353 documents (testing set)
4 categories

Extracting features from the training dataset using a sparse vectorizer
Traceback (most recent call last):
  File "examples/document_classification_20newsgroups.py", line 110, in 
    X_train = vectorizer.fit_transform(data_train.data)
  File "/home/mathieu/Desktop/projects/scikit-learn/sklearn/feature_extraction/text.py", line 637, in fit_transform
    X = self.tc.fit_transform(raw_documents)
  File "/home/mathieu/Desktop/projects/scikit-learn/sklearn/feature_extraction/text.py", line 401, in fit_transform
    term_count_current = Counter(self.analyzer.analyze(doc))
  File "/home/mathieu/Desktop/projects/scikit-learn/sklearn/feature_extraction/text.py", line 191, in analyze
    self.charset_error)
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xda in position 537: invalid continuation byte

ogrisel commented 12 years ago

I'm on it. I wonder what is the underlying encoding of the 20newsgroups data.

ogrisel commented 12 years ago

BTW, this can be reproduced with:

python -c "from sklearn.datasets import fetch_20newsgroups_vectorized; print fetch_20newsgroups_vectorized()"

if you delete the cached version of the extraction.

ogrisel commented 12 years ago

Ok the content is latin1 encoded. I have pushed a fix for the root problem: b071cb0abd7fb2a570f5b315d6e5663d4ddf370c However I don't really know yet how to the make it easy for people that have the cached version of the dataset in there scikit-learn-data folder. I live the issue open until then.

robertlayton commented 12 years ago

We could start marking the cache with the sklearn version number it was downloaded with. If any changes need to appear in future versions, we update the code to say something like if version < 0.13: update().

Highly untested code following:

data = joblib.load(filename)
if last_good_version is not None and data.version < last_good_version:
    # download from internet again, overwriting existing version

Then, each dataset file needs a "last_good_version" variable at the top, which by default is set to None (i.e. don't try fixing).

The simpler option is to add code to just fix this instance and mark it as deprecated for version 0.13. If we forget about it, it will be highlighted and we can remove the updating code (but that doesn't fix people going immediately from 0.10 to 0.14).

mblondel commented 12 years ago

I don't understand the problem: wouldn't the cache return a ready-to-use bunch object and thus is not affected by the bug reported in this issue?

ogrisel commented 12 years ago

The cache would return the raw bytes instead of the latin1 decoded string as done since b071cb0abd7fb2a570f5b315d6e5663d4ddf370c .

ogrisel commented 12 years ago

And I agree that @robertlayton solutions is probably be a good idea.

anirudhs2005 commented 12 years ago

I get a similar error. Please tell me as to what I should do

ogrisel commented 12 years ago

Delete the content of your $HOME/scikit_learn_data folder.

anirudhs2005 commented 12 years ago

I am not working on the test data 'newsgroups'. I am actually writing a spam filter and I need the similarity between two documents. So here's what I have done. Note : a=[] has the files required for similarity calculation.

Type "help", "copyright", "credits" or "license" for more information.

import glob,os a=[] for filecon in glob.glob(os.path.join('spamtest/','*.txt')): ... text=open(filecon,'r') ... a.append(text.read()) ... text.close() ...

from sklearn.feature_extraction.text import Vectorizer vect=Vectorizer() tfidf=vect.fit_transform(a) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 716, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 398, in fit_transform term_count_current = Counter(analyze(doc)) File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 313, in tokenize(preprocess(self.decode(doc))), stop_words) File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 224, in decode doc = doc.decode(self.charset, self.charset_error) File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf4 in position 1267: invalid continuation byte

ogrisel commented 12 years ago

Either you know the encoding / charset of your files and you pass it to the vectorizer, either your write a script that convert your data to utf-8 (e.g. by parsing the metadata of your email headers for instance).

You can also make the vectorizer tolerant to encoding issues but that might hide real problems on non UTF-8 content. Read the documentation:

http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

anirudhs2005 commented 12 years ago

Thank you. I converted the data into utf-8. Initially I was writing a script to convert. But then there is this software called 'UTF-cast' which does the same job. Thanks a lot , again

amueller commented 11 years ago

Was the consensus here to implement @robertlayton's idea?

ogrisel commented 11 years ago

Yes, at least to me.

fannix commented 11 years ago

I guess I bumped into the same issues here. I ran the demo /examples/applications/topics_extraction_with_nmf.py. And the result is also "UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1020: invalid start byte". The corrupted doc is "Robert Ullmann Ariel@World.STD.COM +1 508 879 6994 x226 Quand Maigret poussa la porte du Tabac Fontaine, vers une heure et demie, le patron du bar, qui venait de se lever, descendait lentement un escalier en colima??on qui s'amor??ait dans l'arri??re-salle. ... Arriv?? derri??re le comptoir, il repousa le gar??on d'un geste n??gligent de la main, saisit une bouteille de vin blanc, un verre, m??langea au vin de l'eau min??rale et, la t??te renvers??e en arri??re, se gargarisa. -- Simenon [text is ISO 10646 UTF-1 universal character set]"

amueller commented 11 years ago

@fannix this is not really the same issue ;) I think you got the encoding wrong, but that is something I always trip over, too. If you are lazy, you can set charset_error='ignore' in the CountVectorizer. That will result in invalid characters be discarded. This could mess up your text, as there are many accents.

amueller commented 11 years ago

@ogrisel can we (by which I mean not me ;) make a short hint in the docs on how to make sure that the text is properly encoded? Or would that be to complicated? This is an issue that comes up a lot and it seems people that don't usually handle text (like me) are using count vectorizer without knowing what they are doing ;)

ogrisel commented 11 years ago

There is already plenty of good documentation on how to handle encoding for text files in python, for instance:

http://docs.python.org/2/howto/unicode.html

But nobody read them...

ogrisel commented 11 years ago

If you are lazy, you can set charset_error='ignore' in the CountVectorizer. That will result in invalid characters be discarded. This could mess up your text, as there are many accents.

That will just result in wrong features being extracted. You can hope that the dataset will be redundant enough to get good enough classification but that can also fail completely.

You have to know which charset your text files are encoded with. If you don't know, ask the person who gave you the data. If the person is no longer available you can use heuristic encoding sniffer such as the file unix command but that also is prone to errors and you should do manual decoding with a text editor and check that the content make sense manually.

ogrisel commented 11 years ago

There is chardet that is LGPL charset sniffer written in python. Maybe I could write a how to in the doc...

robertlayton commented 11 years ago

+1 for chardet. I use an in-house function that first tries the "normal" codecs and then uses chardet if that fails. chardet is slow though, but if you know all of a dataset is the same codec then it shouldn't matter too much.

ogrisel commented 11 years ago

I don't want to introduce a dependency on chardet, just show how to use it in a how to section of the text features doc.

rspeer commented 11 years ago

Chardet will not help anyway. For example, consider what would happen if the 20 Newsgroups data was run through chardet:

Chardet is not good at distinguishing ISO encodings from each other. If you had a reason to believe your text was UTF-8, it might do okay, but UTF-8 was barely invented and not actively used at the time of 20 Newsgroups. So chardet detects most of the non-ASCII messages as "ISO-8859-2", which is almost always wrong.
Many messages in 20 Newsgroups are in encodings that chardet does not support, such as cp437 and MacRoman. I don't know of any existing heuristic that distinguishes cp437 from MacRoman.

In a sense, sklearn is doing the wrong thing by reading 20 Newsgroups as Latin-1, because most of the messages that are not in ASCII are not in Latin-1 either.

But in another sense, it's fine, because the best one can do for extracting features is to assume they're all in some sort of single-byte encoding, and just distinguish features by what the bytes are. Decoding as Latin-1 (or any other single-byte encoding) accomplishes this.

If you need to make features out of arbitrary text, and you don't know the encoding of the text, decode it as if it were Latin-1.

ogrisel commented 11 years ago

Alright so I think we are fine letting the 20 newsgroups dataset loader use latin-1. I close this issue and will open a new one to improve the documentation.

ogrisel commented 11 years ago

I have opened #2105. Please @rspeer feel free to review and submit a PR on this if you are interested. I can help you getting started with how the sphinx narrative doc is organized if you need.

scikit-learn / scikit-learn

document classification example is broken #656