Closed mblondel closed 11 years ago
I'm on it. I wonder what is the underlying encoding of the 20newsgroups data.
BTW, this can be reproduced with:
python -c "from sklearn.datasets import fetch_20newsgroups_vectorized; print fetch_20newsgroups_vectorized()"
if you delete the cached version of the extraction.
Ok the content is latin1
encoded. I have pushed a fix for the root problem: b071cb0abd7fb2a570f5b315d6e5663d4ddf370c
However I don't really know yet how to the make it easy for people that have the cached version of the dataset in there scikit-learn-data
folder. I live the issue open until then.
We could start marking the cache with the sklearn version number it was downloaded with. If any changes need to appear in future versions, we update the code to say something like if version < 0.13: update()
.
Highly untested code following:
data = joblib.load(filename)
if last_good_version is not None and data.version < last_good_version:
# download from internet again, overwriting existing version
Then, each dataset file needs a "last_good_version" variable at the top, which by default is set to None (i.e. don't try fixing).
The simpler option is to add code to just fix this instance and mark it as deprecated for version 0.13. If we forget about it, it will be highlighted and we can remove the updating code (but that doesn't fix people going immediately from 0.10 to 0.14).
I don't understand the problem: wouldn't the cache return a ready-to-use bunch object and thus is not affected by the bug reported in this issue?
The cache would return the raw bytes instead of the latin1 decoded string as done since b071cb0abd7fb2a570f5b315d6e5663d4ddf370c .
And I agree that @robertlayton solutions is probably be a good idea.
I get a similar error. Please tell me as to what I should do
Delete the content of your $HOME/scikit_learn_data
folder.
I am not working on the test data 'newsgroups'. I am actually writing a spam filter and I need the similarity between two documents. So here's what I have done. Note : a=[] has the files required for similarity calculation.
Type "help", "copyright", "credits" or "license" for more information.
import glob,os a=[] for filecon in glob.glob(os.path.join('spamtest/','*.txt')): ... text=open(filecon,'r') ... a.append(text.read()) ... text.close() ...
from sklearn.feature_extraction.text import Vectorizer vect=Vectorizer() tfidf=vect.fit_transform(a) Traceback (most recent call last): File "
", line 1, in File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 716, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 398, in fit_transform term_count_current = Counter(analyze(doc)) File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 313, in tokenize(preprocess(self.decode(doc))), stop_words) File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 224, in decode doc = doc.decode(self.charset, self.charset_error) File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf4 in position 1267: invalid continuation byte
Either you know the encoding / charset of your files and you pass it to the vectorizer, either your write a script that convert your data to utf-8 (e.g. by parsing the metadata of your email headers for instance).
You can also make the vectorizer tolerant to encoding issues but that might hide real problems on non UTF-8 content. Read the documentation:
Thank you. I converted the data into utf-8. Initially I was writing a script to convert. But then there is this software called 'UTF-cast' which does the same job. Thanks a lot , again
Was the consensus here to implement @robertlayton's idea?
Yes, at least to me.
I guess I bumped into the same issues here. I ran the demo /examples/applications/topics_extraction_with_nmf.py. And the result is also "UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1020: invalid start byte". The corrupted doc is "Robert Ullmann Ariel@World.STD.COM +1 508 879 6994 x226 Quand Maigret poussa la porte du Tabac Fontaine, vers une heure et demie, le patron du bar, qui venait de se lever, descendait lentement un escalier en colima??on qui s'amor??ait dans l'arri??re-salle. ... Arriv?? derri??re le comptoir, il repousa le gar??on d'un geste n??gligent de la main, saisit une bouteille de vin blanc, un verre, m??langea au vin de l'eau min??rale et, la t??te renvers??e en arri??re, se gargarisa. -- Simenon [text is ISO 10646 UTF-1 universal character set]"
@fannix this is not really the same issue ;) I think you got the encoding wrong, but that is something I always trip over, too. If you are lazy, you can set charset_error='ignore'
in the CountVectorizer
. That will result in invalid characters be discarded. This could mess up your text, as there are many accents.
@ogrisel can we (by which I mean not me ;) make a short hint in the docs on how to make sure that the text is properly encoded? Or would that be to complicated? This is an issue that comes up a lot and it seems people that don't usually handle text (like me) are using count vectorizer without knowing what they are doing ;)
There is already plenty of good documentation on how to handle encoding for text files in python, for instance:
http://docs.python.org/2/howto/unicode.html
But nobody read them...
If you are lazy, you can set charset_error='ignore' in the CountVectorizer. That will result in invalid characters be discarded. This could mess up your text, as there are many accents.
That will just result in wrong features being extracted. You can hope that the dataset will be redundant enough to get good enough classification but that can also fail completely.
You have to know which charset your text files are encoded with. If you don't know, ask the person who gave you the data. If the person is no longer available you can use heuristic encoding sniffer such as the file
unix command but that also is prone to errors and you should do manual decoding with a text editor and check that the content make sense manually.
There is chardet that is LGPL charset sniffer written in python. Maybe I could write a how to in the doc...
+1 for chardet. I use an in-house function that first tries the "normal" codecs and then uses chardet if that fails. chardet is slow though, but if you know all of a dataset is the same codec then it shouldn't matter too much.
I don't want to introduce a dependency on chardet, just show how to use it in a how to section of the text features doc.
Chardet will not help anyway. For example, consider what would happen if the 20 Newsgroups data was run through chardet:
In a sense, sklearn is doing the wrong thing by reading 20 Newsgroups as Latin-1, because most of the messages that are not in ASCII are not in Latin-1 either.
But in another sense, it's fine, because the best one can do for extracting features is to assume they're all in some sort of single-byte encoding, and just distinguish features by what the bytes are. Decoding as Latin-1 (or any other single-byte encoding) accomplishes this.
If you need to make features out of arbitrary text, and you don't know the encoding of the text, decode it as if it were Latin-1.
Alright so I think we are fine letting the 20 newsgroups dataset loader use latin-1. I close this issue and will open a new one to improve the documentation.
I have opened #2105. Please @rspeer feel free to review and submit a PR on this if you are interested. I can help you getting started with how the sphinx narrative doc is organized if you need.