Aligned Neural Topic Model (ANTM) for Exploring Evolving Topics: a dynamic neural topic model that uses document embeddings (data2vec) to compute clusters of semantically similar documents at different periods, and aligns document clusters to represent topic evolution.
I try to run code in colab but I got some issue like this:
contextual document embedding is initiated...
Pandas Apply: 100%
2000/2000 [23:34<00:00, 1.27it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (549 > 512). Running this sequence through the model will result in indexing errors
Summarizing a document with BART due to its Large length for Embedding...
Summarizing a document with BART due to its Large length for Embedding...
Summarizing a document with BART due to its Large length for Embedding...
Summarizing a document with BART due to its Large length for Embedding...
Sliding Window Segmentation is initialized...
Aligned Dimension Reduction is initialized...
Sequential Document-cluster association is initialized...
Cluster Alignment Procedure is initialized...
---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
[<ipython-input-5-675c8f899d3c>](https://localhost:8080/#) in <cell line: 2>()
1 #learn the model and save it
----> 2 topics_per_period=model.fit(save=True)
3 #output is a list of timeframes including all the topics associated with that period
8 frames
[/usr/local/lib/python3.10/dist-packages/nltk/data.py](https://localhost:8080/#) in find(resource_name, paths)
581 sep = "*" * 70
582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583 raise LookupError(resource_not_found)
584
585
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
I run code in readme which is:
from antm import ANTM
import pandas as pd
# load data
df=pd.read_parquet("./data/dblpFullSchema_2000_2020_extract_big_data_2K.parquet")
df=df[["abstract","year"]].rename(columns={"abstract":"content","year":"time"})
df=df.dropna().sort_values("time").reset_index(drop=True).reset_index()
# choosing the windows size and overlapping length for time frames
window_size = 6
overlap = 2
#initialize model
model=ANTM(df,overlap,window_size,umap_n_neighbors=10, partioned_clusttering_size=5,mode="data2vec",num_words=10,path="./saved_data")
#learn the model and save it
topics_per_period=model.fit(save=True) # <------- ERROR when save model.
#output is a list of timeframes including all the topics associated with that period
I try to run code in colab but I got some issue like this:
I run code in readme which is: