koaning / cluestar

Gain clues from clustering!
https://koaning.github.io/cluestar/
MIT License
304 stars 14 forks source link

ValueError: You must give this preprocessor text as input. #1

Closed ggnicolau closed 2 years ago

ggnicolau commented 2 years ago

Hi. I'm providing the same type of data and using the same code as inside the Notebook of example and I'm getting this error. What could be wrong? I get the following error: ValueError: You must give this preprocessor text as input.. I tried to debug but couldn't find out how to solve it. Any clues what's going on?


# Transforming data:
import re
import pandas as pd
import nltk
df = pd.read_csv('buscas_consecutivas.csv', sep = ',')
df["full"] = df["query"] + ' ' + df["next_query"]
rule = r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"
texts = df.loc[:5000, "full"].apply(lambda x: re.split(rule, x))
texts = [''.join(ele) for ele in texts]```

# Transformed data:
texts[:3]
['lg k 22 celular lg k 22',
 'estantes estantes de acoestante colorida',
 'porquinho cozinha porquinho saleiro']

# Runing the pipeline:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import umap
from whatlies.language import UniversalSentenceLanguage
from cluestar import plot_text
pipe = make_pipeline(TfidfVectorizer(),
                     UniversalSentenceLanguage(variant='multi'),
                     umap.UMAP()
                    )

X = pipe.fit_transform(texts)

plot_text(X, texts, color_words=["vinho", "camisa", "tinta", "whiskey"])
koaning commented 2 years ago

I think you're doing something that's different.

This line of code gives TF-idf vectors to the universal sentence encoder, as opposed to the plain texts, which it indeed cannot handle.

pipe = make_pipeline(TfidfVectorizer(),
                     UniversalSentenceLanguage(variant='multi'),
                     umap.UMAP()
                    )

This should work though;

pipe = make_pipeline(
     UniversalSentenceLanguage(variant='multi'),
     umap.UMAP()
)
ggnicolau commented 2 years ago

OMG sorry, obviously! Lack of attention! It works!