rapidsai / rapids-examples

33 stars 24 forks source link

cuBERTopic: Can't install rapids-21.12 so I changed to 22.06 instead; I've also changed NVCC PATH; then, I get import errors and then I get 'Duplicate columns' error. #51

Closed ggnicolau closed 1 year ago

ggnicolau commented 2 years ago

Hi, I'm trying to use cuBERTopic.

I tried to install using the YAML or using conda code provided by the repository. Both didn't work since they can't find version 21.12. So I decided to install it using version 22.06 with some adaptions for a VM inside Google Cloud Platform, using CUDA-11.0:

conda create -n rapids-22.06 -c rapidsai-nightly -c nvidia -c conda-forge \
    rapids=22.06 python=3.8 cudatoolkit=11.0
conda activate rapids-22.06

But, then, I've got a NVCC PATH warning while importing cuBERTopic. So I changed the beginning of cuBERTopic.py file to my current cuda PATH:

if "NVCC" not in os.environ:
    os.environ["NVCC"] = "/usr/local/cuda-11.0/bin/nvcc"
    warnings.warn(
        "NVCC Path not found, set to  : /usr/local/cuda-11.0/bin/nvcc . \nPlease set NVCC as appropitate to your environment"
    )

Then, when I try to import the libraries as followed by the example notebook, I get an AttributeError: 'NoneType' object has no attribute 'split'; --> 324 cmd = _nvcc.split() error. But I'm able to import if I change the order of the imports:

from cuBERTopic import gpu_BERTopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from transformers import AutoTokenizer, AutoModel
import torch
from cuBERTopic import gpu_BERTopic
import rmm
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
rmm.reinitialize(pool_allocator=True,initial_pool_size=5e+9)

Then, everything works and I can check and see that it's using the GPU for training the Notebook provided as an example, but by the end, I get the following error while using tf-idf: ValueError: Duplicate column names are not allowed. The full log is:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 topics, probs = topic_model.fit_transform(docs)

File ~/rapids-examples/cuBERT_topic_modelling/cuBERTopic.py:220, in gpu_BERTopic.fit_transform(self, data)
    217 del umap_embeddings
    219 # Topic representation
--> 220 tf_idf, count, labels = self.create_topics(documents)
    221 top_n_words, name_repr = self.extract_top_n_words_per_topic(
    222     tf_idf, count, labels, n=30
    223 )
    225 self.topic_sizes_df["Name"] = self.topic_sizes_df["Topic"].map(name_repr)

File ~/rapids-examples/cuBERT_topic_modelling/cuBERTopic.py:129, in gpu_BERTopic.create_topics(self, docs_df)
    117 """Extract topics from the clusters using a class-based TF-IDF
    118 Arguments:
    119     docs_df: DataFrame containing documents and other information
   (...)
    125     topic_labels: A list of unique topic labels
    126 """
    127 topic_labels = docs_df["Topic"].unique()
--> 129 tf_idf, vectorizer = self.new_c_tf_idf(docs_df, len(docs_df))
    130 return tf_idf, vectorizer, topic_labels

File ~/rapids-examples/cuBERT_topic_modelling/cuBERTopic.py:107, in gpu_BERTopic.new_c_tf_idf(self, document_df, m, ngram_range)
     90 """Calculate a class-based TF-IDF where m is the number of total documents.
     91 
     92 Arguments:
   (...)
    104     count: object of class CountVecWrapper
    105 """
    106 count = CountVecWrapper(ngram_range=ngram_range)
--> 107 X = count.fit_transform(document_df)
    108 multiplier = None
    110 transformer = ClassTFIDF().fit(X, n_samples=m, multiplier=multiplier)

File ~/rapids-examples/cuBERT_topic_modelling/vectorizer/vectorizer.py:54, in CountVecWrapper.fit_transform(self, docs_df)
     50 tokenized_df = self._create_tokenized_df(docs)
     51 self.vocabulary_ = tokenized_df["token"].unique()
     53 merged_count_df = (
---> 54     cudf.merge(tokenized_df, topic_df, how="left")
     55     .sort_values("Topic_ID")
     56     .rename({"Topic_ID": "doc_id"}, axis=1)
     57 )
     59 count_df = self._count_vocab(merged_count_df)
     61 # TODO: handle empty docids case later

File /opt/conda/envs/rapids-22.06/lib/python3.8/contextlib.py:75, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     72 @wraps(func)
     73 def inner(*args, **kwds):
     74     with self._recreate_cm():
---> 75         return func(*args, **kwds)

File /opt/conda/envs/rapids-22.06/lib/python3.8/site-packages/cudf/core/dataframe.py:2893, in DataFrame.rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
   2890     out = DataFrame(index=self.index)
   2892 if columns:
-> 2893     out._data = self._data.rename_levels(mapper=columns, level=level)
   2894 else:
   2895     out._data = self._data.copy(deep=copy)

File /opt/conda/envs/rapids-22.06/lib/python3.8/site-packages/cudf/core/column_accessor.py:552, in ColumnAccessor.rename_levels(self, mapper, level)
    549         new_col_names = [mapper(col_name) for col_name in self.keys()]
    551     if len(new_col_names) != len(set(new_col_names)):
--> 552         raise ValueError("Duplicate column names are not allowed")
    554     ca = ColumnAccessor(
    555         dict(zip(new_col_names, self.values())),
    556         level_names=self.level_names,
    557         multiindex=self.multiindex,
    558     )
    560 return self.__class__(ca)

ValueError: Duplicate column names are not allowed

I know I've did a bunch of critical changes here, one on top of another. But maybe you can help me to make it work properly? :)

Wish you the best! Thank you for implementing BERTopic with RAPIDS!

ggnicolau commented 2 years ago

I've installed rapids-21.12 (had to remove 'nightly' from the code to find it), then I installed cudatoolkit 11.2 (couldn't install through conda or pip, I had to install through wget). Now I'm getting the following error on the example notebook: temporary_buffer::allocate: get_temporary_buffer failed. CudaAPIError: [719] Call to cuLinkCreate results in CUDA_ERROR_LAUNCH_FAILED.

Then I tried with a small sample (5000 rows) from another dataset, but if I try to get_topic I have the following error: TypeError: 'NoneType' object is not subscriptable.

VibhuJawa commented 2 years ago

@ggnicolau , Thanks a lot for trying out the repo and giving such a detailed description of changes.

Let me look into this and update here. Thanks for the patience.

shashankgaur3 commented 2 years ago

I am facing a similar issue with the cuBERTopic fit_transform method. I was able to successfully run the topic modelling with bertopic.

Error Log below:

topics_gpu, probs_gpu = gpu_topic.fit_transform(docs) Traceback (most recent call last): File "", line 1, in File "/rapids/notebooks/rapids-examples/cuBERT_topic_modelling/cuBERTopic.py", line 220, in fit_transform tf_idf, count, labels = self.create_topics(documents) File "/rapids/notebooks/rapids-examples/cuBERT_topic_modelling/cuBERTopic.py", line 129, in create_topics tf_idf, vectorizer = self.new_c_tf_idf(docs_df, len(docs_df)) File "/rapids/notebooks/rapids-examples/cuBERT_topic_modelling/cuBERTopic.py", line 107, in new_c_tf_idf X = count.fit_transform(document_df) File "/rapids/notebooks/rapids-examples/cuBERT_topic_modelling/vectorizer/vectorizer.py", line 54, in fit_transform cudf.merge(tokenized_df, topic_df, how="left") File "/opt/conda/envs/rapids/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cudf/core/dataframe.py", line 3179, in rename out._data = self._data.rename_levels(mapper=columns, level=level) File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cudf/core/column_accessor.py", line 592, in rename_levels raise ValueError("Duplicate column names are not allowed") ValueError: Duplicate column names are not allowed

VibhuJawa commented 1 year ago

We now recommend users to use the upstream BERTopic directly as they now support rapids directly. See below:

In the time since the blog post/code was released, the BERTopic library has added initial support for cuML. We recommend using cuML directly with BERTopic, which you can do by following the example below drawn from the BERTopic documentation.


from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)
# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)