Add analyze text #3 - Githubissues

kbogas commented 3 years ago

This is in response to #3 .

Added first working version for the textual component.

Models implemented:

TF-IDF
Latent Dirichlet Allocation (LDA)
Non-negative Matrix Factorization (NMF)

Added a simple evaluation process (printing the most similar pair of movies found) and streamlined the code to be modular for new models.

The whole module works using a very basic settings.yaml containing all the info about data paths, which models will be used, hyperparams etc.

For the time being, I made the assumption that we persist both the features and the models (if needed) to a local folder.

This means that the generated features for each model are saved in a .csv file in the output folder, alongside the corresponding pickled model if needed.

@tyiannak We will need to discuss a formal way of operating across the module regarding features, persistency, evaluation, testing etc?

tyiannak commented 3 years ago

I checked it and it ran as expected.

My only comment is: shouldn't we return a list of all extracted features in from extract_features(), e.g. as a dict of filenames-->featurevectors. Otherwise, currently one would need to load the csvs (output/lda_features.csv, output/nmf_features.csv, output/tfidf_features.csv).

pakoromilas commented 3 years ago

Works fine for me too.

tyiannak / multimodal_movie_analysis

Add analyze text #3 #8