mesolitica / malaya

Natural Language Toolkit for Malaysian language, https://malaya.readthedocs.io/
MIT License
469 stars 127 forks source link

Outdated method for `vectorizer.get_feature_names()` due to `scikit-learn>=1.2` #211

Closed wanadzhar913 closed 4 months ago

wanadzhar913 commented 4 months ago

Hi Husein,

Moga sihat 21! Was messing around with Malaya's topic modelling module and happened upon the error below.

Digging deeper into scikit-learn's documentation, the get_feature_names method was deprecated and replaced with get_feature_names_out since version 1.2 onwards. [Link](https://scikit-learn.org/1.1/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#:~:text=document%2Dterm%20matrix.-,get_feature_names(),get_feature_names%20is%20deprecated%20in%201.0%20and%20will%20be%20removed%20in%201.2.,-get_feature_names_out(%5Binput_features%5D)) for your reference.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-7-23edf621a026>](https://localhost:8080/#) in <cell line: 1>()
----> 1 lda = malaya.topic_model.decomposition.fit(
      2     stem_output_hfmodel,
      3     LatentDirichletAllocation,
      4     vectorizer = vectorizer,
      5     n_topics = 10,

[/usr/local/lib/python3.10/dist-packages/malaya/topic_model/decomposition.py](https://localhost:8080/#) in fit(corpus, model, vectorizer, n_topics, cleaning, stopwords, **kwargs)
    179 
    180     tf = vectorizer.fit_transform(corpus)
--> 181     tf_features = vectorizer.get_feature_names()
    182     compose = model(n_topics).fit(tf)
    183     return Topic(

AttributeError: 'SkipGramCountVectorizer' object has no attribute 'get_feature_names'

Here's my code for reproducibility:

import malaya
from malaya.text.vectorizer import SkipGramCountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

stopwords = malaya.text.function.get_stopwords()

documents = [
    "Emas",
    "Hm, moga jadi kahwin",
    "Naaaakk kahwin",
    "Nikah 25/2/25",
    "Universiti anak",
    "My Marriage Story",
    "Yuran Sekolah Fatimah",
    "em@s.com",
    "beli kereta",
    "car insurance",
    "my college fund"
]

# Stem documents with a Huggingface model
hfmodel_stem = malaya.stem.huggingface()

stem_output_hfmodel = []

for j in documents:
    stem_output_hfmodel.append(hfmodel_stem.stem(j))

# Load vectorizer object
vectorizer = SkipGramCountVectorizer(
    max_df = 0.95,
    min_df = 1,
    ngram_range = (1, 3),
    stop_words = stopwords,
    skip = 2,
)

# Create LDA object (error found here)
lda = malaya.topic_model.decomposition.fit(
    stem_output_hfmodel,
    LatentDirichletAllocation,
    vectorizer = vectorizer,
    n_topics = 10,
)

Below is my requirements.txt:

dateparser==1.2.0
scikit-learn==1.2.2
requests==2.31.0
unidecode==1.3.8
numpy==1.25.2
scipy==1.11.4
ftfy==6.2.0
networkx==3.3
sentencepiece==0.1.99
tqdm==4.66.4
malaya-boilerplate==0.0.25
regex==2024.5.15
transformers==4.42.4
wanadzhar913 commented 4 months ago

Added the minor fix above. Let me know kalau okay.