lfmatosm / embedded-topic-model

A package to run embedded topic modelling with ETM. Adapted from the original at: https://github.com/adjidieng/ETM
MIT License
85 stars 8 forks source link

[BUG] Testing on a single document #31

Open MaazBinMusa opened 1 month ago

MaazBinMusa commented 1 month ago

Describe the bug Testing on a single document results in a code crash

To Reproduce Steps to reproduce the behavior:

  1. Train a model
  2. Try to test it on 1 document

Reproduction example I copy pasted code from the readme.md example. The only difference was my train and test sets were not different. I just pulled 1 document from the train set and sent that as input [test_doc] to the transform function.

joebhakim commented 1 week ago

Ran into the same thing, I think that based on the documentation, (looking at v1.5.1), for the stop_words arg

If None, no stop words will be used. In this case, setting max_df to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms.

It might be the case that some corpuses are too small or something to automatically infer stop words. I'm just skipping that step in documents_without_stop_words.