Closed asma-ui closed 1 year ago
Hi @asma-ui, sorry for the late response and thanks for using the package!
If I understand correctly, you must train an ETM instance for each element inside your array. Does your array elements have the expected format?
When you call etm_instance.fit
you must pass a reference to a dictionary with the same format outputted by the preprocessing.create_etm_datasets
method. This should be a dictionary including two keys to represent a bag-of-words corpus: tokens
(which is the BOW representation of document words) and counts
(which is the sum of counts for each word).
Take a look at the README for a better understanding, but basically, you should call this method to create the BOW representations of your dataset expected by ETM:
from embedded_topic_model.utils import preprocessing
import json
# Loading a dataset in JSON format. As said, documents must be composed by string sentences
corpus_file = 'datasets/example_dataset.json'
documents_raw = json.load(open(dataset, 'r'))
documents = [document['body'] for document in documents_raw]
# Preprocessing the dataset. train_dataset is a dict with two keys: tokens and counts
vocabulary, train_dataset, _, = preprocessing.create_etm_datasets(
documents, #here you will put your original documents
min_df=0.01,
max_df=0.75,
train_size=0.85,
)
If you have any further questions, feel free to respond.
Cheers, Luiz
Closing issue as stale. @asma-ui if you need any further help, report back.
i am training etm topic model on a dataset,but while fitting ETM instance i have to iterate through (train) which is a dictionary of numpy arrays,i don'nt know how to iterate through it inside a function when i call
This is the unchanged code which gives ERROR:list indices must be integers or slices, not str
IF i write it this way
ERROR:unhashable type: 'dict'
if i write it this way
ERROR:only integers, slices (
:
), ellipsis (...
), numpy.newaxis (None
) and integer or boolean arrays are valid indices [enter image description here][1] train dictionary look like this[1]: https://i.stack.imgur.com/yLJ3J.jpg # #[Duct train have this type of data]( [1]: https://i.stack.imgur.com/yLJ3J.jpg #)