lfmatosm / embedded-topic-model

A package to run embedded topic modelling with ETM. Adapted from the original at: https://github.com/adjidieng/ETM
MIT License
85 stars 8 forks source link

iterating through dictionary having multiple lists #11

Closed asma-ui closed 1 year ago

asma-ui commented 1 year ago

i am training etm topic model on a dataset,but while fitting ETM instance i have to iterate through (train) which is a dictionary of numpy arrays,i don'nt know how to iterate through it inside a function when i call

etm_instance = ETM(
doc,embeddings=word_vectors, # You can pass here the path to a word2vec file or
                                   # a KeyedVectors instance
num_topics=8,
epochs=300,
debug_mode=True,
train_embeddings=False, # Optional. If True, ETM will learn word embeddings jointly with
                            # topic embeddings. By default, is False. If 'embeddings' argument
                            # is being passed, this argument must not be True
)
etm_instance.fit(train)

This is the unchanged code which gives ERROR:list indices must be integers or slices, not str

IF i write it this way

for n,count in enumerate(train):
 etm_instance.fit(train[n][count])

ERROR:unhashable type: 'dict'

if i write it this way

for i in range(len(train)):
etm_instance.fit(train[i]['tokens'])

ERROR:only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices [enter image description here][1] train dictionary look like this

[1]: https://i.stack.imgur.com/yLJ3J.jpg # #[Duct train have this type of data]( [1]: https://i.stack.imgur.com/yLJ3J.jpg #)

lfmatosm commented 1 year ago

Hi @asma-ui, sorry for the late response and thanks for using the package!

If I understand correctly, you must train an ETM instance for each element inside your array. Does your array elements have the expected format?

When you call etm_instance.fit you must pass a reference to a dictionary with the same format outputted by the preprocessing.create_etm_datasets method. This should be a dictionary including two keys to represent a bag-of-words corpus: tokens (which is the BOW representation of document words) and counts (which is the sum of counts for each word).

Take a look at the README for a better understanding, but basically, you should call this method to create the BOW representations of your dataset expected by ETM:

from embedded_topic_model.utils import preprocessing
import json

# Loading a dataset in JSON format. As said, documents must be composed by string sentences
corpus_file = 'datasets/example_dataset.json'
documents_raw = json.load(open(dataset, 'r'))
documents = [document['body'] for document in documents_raw]

# Preprocessing the dataset. train_dataset is a dict with two keys: tokens and counts
vocabulary, train_dataset, _, = preprocessing.create_etm_datasets(
    documents, #here you will put your original documents
    min_df=0.01, 
    max_df=0.75, 
    train_size=0.85, 
)

If you have any further questions, feel free to respond.

Cheers, Luiz

lfmatosm commented 1 year ago

Closing issue as stale. @asma-ui if you need any further help, report back.