lfmatosm / embedded-topic-model

A package to run embedded topic modelling with ETM. Adapted from the original at: https://github.com/adjidieng/ETM
MIT License
85 stars 8 forks source link

Example of using trained model to predict topics for new documents? #10

Closed marcmaxson closed 1 year ago

marcmaxson commented 1 year ago

I want to use this to train and then predict new documents, but I don't see an example of this, and I can't find any class methods in your Model or ETM like a .transform() method. How do you use this in production?

umarIft commented 1 year ago

I am facing the same issue. After having gone through the same steps as @marcmaxson , the question remains. How can we use the trained model to predict topics in documents unseen by the model. For a trained LDA model, it was rather simple. Here it is not that intuitive. @lffloyd any help is much appreciated.

Idan-Garay commented 1 year ago

Same, the etmModelInstance.fit() has test_data parameters but it's only for evaluating perplexity so I also have no idea how to predict using the model. Hi @lffloyd , is there a way to predict using the model instance?

lfmatosm commented 1 year ago

Hi @marcmaxson and @umarIft. Sorry for the (very) late response and thanks for your report.

To the best of my knowledge, the original code did not have any implementation of prediction or transformation. This was also a problem that I faced when exploring the original implementation. This package was created to facilitate research purposes, and for that intent, prediction was not something that I originally needed to explore. As such, there's no method of doing that currently.

However, this method is logically needed and as such, I've started some experimentation around it. I will report back here when the next release is ready to address this purpose.

Cheers, Luiz

lfmatosm commented 1 year ago

Hi @marcmaxson and @umarIft, PR #14 adds the transform method and #15 updates the documentation with usage.

The next release will be created soon.

lfmatosm commented 1 year ago

@marcmaxson and @umarIft release 1.1.0 created at https://pypi.org/project/embedded-topic-model/

Now you can find an example of topic prediction on the README.md:

from embedded_topic_model.utils import preprocessing
from embedded_topic_model.models.etm import ETM

corpus_file = 'datasets/example_dataset.json'
documents_raw = json.load(open(corpus_file, 'r'))
documents = [document['body'] for document in documents_raw]

# Splits into train/test datasets
train = documents[:len(documents)-100]
test = documents[len(documents)-100:]

# Model fitting
# ...

# The vocabulary must be the same one created during preprocessing of the training dataset (see above)
preprocessed_test = preprocessing.create_bow_dataset(test, vocabulary)
# Transforms test dataset and returns normalized document topic distribution
test_d_t_dist = etm_instance.transform(preprocessed_test)
print(f'test_d_t_dist: {test_d_t_dist}')

However, if possible, I suggest you to update to the latest 1.2.1 version, which also brings dependency updates which makes the library compatible with torch>=2.0 and newer python versions.