I'm trying to use the Naive Bayes Classifier included in TextBlob. While initialising the object, I'm passing a file handle to a json file, and have also specified the json format. This code can reproduce my problem:
from textblob.classifiers import NaiveBayesClassifier as nbc
with open('data.json', 'r') as file_handle:
_ = nbc(file_handle, format="json")
Now, data.json is a file of size 7.8 GB. After a couple of seconds, the program errors out:
Traceback (most recent call last):
File "load_test.py", line 4, in <module>
_ = nbc(file_handle, format="json")
File "/usr/local/lib/python3.6/site-packages/textblob/classifiers.py", line 205, in __init__
super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
File "/usr/local/lib/python3.6/site-packages/textblob/classifiers.py", line 136, in __init__
self.train_set = self._read_data(train_set, format)
File "/usr/local/lib/python3.6/site-packages/textblob/classifiers.py", line 157, in _read_data
return format_class(dataset, **self.format_kwargs).to_iterable()
File "/usr/local/lib/python3.6/site-packages/textblob/formats.py", line 115, in __init__
self.dict = json.load(fp)
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 296, in load
return loads(fp.read(),
OSError: [Errno 22] Invalid argument
After some brief googling, it looks like this is happening because of the large size of the json file. I could not find any clear-cut solutions to the problem though.
Considering that Naive Bayes Classifiers often have large data-sets they are trained upon, this would be a necessity for many users. Could this issue please be rectified?
(If the data.json file is required to reproduce this issue, please inform me and I will upload it to a host. AFAIK, any file slightly on the bigger size (1GB?) should help in reproducing the above issue.)
Hello!
I'm trying to use the Naive Bayes Classifier included in TextBlob. While initialising the object, I'm passing a file handle to a json file, and have also specified the json format. This code can reproduce my problem:
Now,
data.json
is a file of size 7.8 GB. After a couple of seconds, the program errors out:After some brief googling, it looks like this is happening because of the large size of the json file. I could not find any clear-cut solutions to the problem though.
Considering that Naive Bayes Classifiers often have large data-sets they are trained upon, this would be a necessity for many users. Could this issue please be rectified?
(If the
data.json
file is required to reproduce this issue, please inform me and I will upload it to a host. AFAIK, any file slightly on the bigger size (1GB?) should help in reproducing the above issue.)