sloria / TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
https://textblob.readthedocs.io/
MIT License
9.09k stars 1.13k forks source link

Training the NaiveBayes Classifier using large JSON files #222

Open double-fault opened 6 years ago

double-fault commented 6 years ago

Hello!

I'm trying to use the Naive Bayes Classifier included in TextBlob. While initialising the object, I'm passing a file handle to a json file, and have also specified the json format. This code can reproduce my problem:

from textblob.classifiers import NaiveBayesClassifier as nbc

with open('data.json', 'r') as file_handle:
    _ = nbc(file_handle, format="json")

Now, data.json is a file of size 7.8 GB. After a couple of seconds, the program errors out:

Traceback (most recent call last):
  File "load_test.py", line 4, in <module>
    _ = nbc(file_handle, format="json")
  File "/usr/local/lib/python3.6/site-packages/textblob/classifiers.py", line 205, in __init__
    super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/textblob/classifiers.py", line 136, in __init__
    self.train_set = self._read_data(train_set, format)
  File "/usr/local/lib/python3.6/site-packages/textblob/classifiers.py", line 157, in _read_data
    return format_class(dataset, **self.format_kwargs).to_iterable()
  File "/usr/local/lib/python3.6/site-packages/textblob/formats.py", line 115, in __init__
    self.dict = json.load(fp)
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 296, in load
    return loads(fp.read(),
OSError: [Errno 22] Invalid argument

After some brief googling, it looks like this is happening because of the large size of the json file. I could not find any clear-cut solutions to the problem though.

Considering that Naive Bayes Classifiers often have large data-sets they are trained upon, this would be a necessity for many users. Could this issue please be rectified?

(If the data.json file is required to reproduce this issue, please inform me and I will upload it to a host. AFAIK, any file slightly on the bigger size (1GB?) should help in reproducing the above issue.)