sloria / TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
https://textblob.readthedocs.io/
MIT License
9.11k stars 1.13k forks source link

ValueError when loading classifier training data #406

Closed bwareham closed 2 years ago

bwareham commented 2 years ago

I am trying to create and train a DecisionTreeClassifier

cl = DecisionTreeClassifier("output/train_data.json", format="json")

But get the following error:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/user/apps/myapp/env/lib/python3.10/site-packages/textblob/classifiers.py", line 205, in __init__
    super(NLTKClassifier, self).__init__(train_set, feature_extractor, format, **kwargs)
  File "/home/user/apps/myapp/env/lib/python3.10/site-packages/textblob/classifiers.py", line 139, in __init__
    self._word_set = _get_words_from_dataset(self.train_set)  # Keep a hidden set of unique words.
  File "/home/user/apps/myapp/env/lib/python3.10/site-packages/textblob/classifiers.py", line 63, in _get_words_from_dataset
    return set(all_words)
  File "/home/user/apps/myapp/env/lib/python3.10/site-packages/textblob/classifiers.py", line 62, in <genexpr>
    all_words = chain.from_iterable(tokenize(words) for words, _ in dataset)
ValueError: not enough values to unpack (expected 2, got 1)

Seems like a problem with my json data, but it seems to conform to what the documentation says is required:

[
    {
        "text": "artistname: albumtitle review \u2013 party like it\u2019s 2002. High-energy bangers follow one after the other as the Canadian returns to her pop-punk roots",
        "label": "True"
    },
    {
        "text": "DOWNLOAD FULL ALBUM: artistname \u2013 albumtitle (Zip File) | AbokiMusic. artistname\u00a0has dropped a brand new music album titled\u00a0artistname albumtitle album zip download and you can download full album below",
        "label": "False"
    },
    {
        "text": "ALBUM REVIEW: albumtitle - artistname - Distorted Sound Magazine. Jack Moar reviews the third album from Italian doom metallers artistname. Read the review of albumtitle here on Distorted Sound Magazine!",
        "label": "True"
    },
    {
        "text": "'The Voice' Winner artistname Releases 'albumtitle' - Talent Recap. The Voice - 'The Voice' season four winner, artistname reminisces on her nine years in the music industry with her new album.",
        "label": "True"
    },
    {
        "text": "artistname publica su nuevo y primer disco \u2018albumtitle. KORApublica\u2019albumtitle\u2019su primer discoLa joven artista catalana acaba de lanzar su nuevo y primer disco \u2018albumtitle\u2019.\u00a0Esc\u00fachalo aqu\u00edTras la publicaci\u00f3n de\u00a0\u2026",
        "label": "False"
    }
…
]

A similar issue was raised and closed without explanation.

bwareham commented 2 years ago

I figured it out -- I was trying to create the classifier with an unopened training_data file. It worked when I used 'with open(file)' as so:

with open("output/train_data.json", "r+") as file:
    cl = DecisionTreeClassifier(file, format="json")

The need to do this wasn’t obvious to me from the documentation, especially since the process was running on something (when I ran cl.labels() it returned "label", so, as the error noted, it was retrieving one value, just not what I wanted). So it may be worth clarifying in a future update.