Experiment Intervention predictions using topic mixture on GPHIN data

ImaneChafi commented 4 years ago

For the new GPHIN data created by the GPHIN scraper, experiment on topic mixture ETM, METM, D-ETM and DM-ETM

ImaneChafi commented 4 years ago

[x] METM - Works : https://github.com/li-lab-mcgill/covid19_media/commit/a4f6073fc9ae763bbe5975a337123ddbe6bdc610
[x] DM-ETM - Pending
[x] DETM - Works, NAN values appear
[x] ETM - Works, used this repo for reference : https://github.com/adjidieng/ETM
[ ] S-DETM - Pending

ImaneChafi commented 4 years ago

Let’s do thorough experiments to compare DM-ETM and D-ETM with various settings on the 4 datasets (Aylien, GPHIN, GPHIN online parse, WHO):

[ ] D-ETM without pre-trained words embeddings
[ ] D-ETM with pre-trained words embeddings (skipgram_emb_300d.txt not alyien)
[ ] DM-ETM without pre-trained word embeddings and pre-trained source embeddings
[ ] DM-ETM with pre-trained word embeddings but without pre-trained source embeddings
[ ] DM-ETM with both pre-trained word embeddings and pre-trained source embeddings

Try these topic numbers {10, 20, 30, 40, 50}. Run each model for 100 epochs and let the model do annealing and choose the best model based on the Val ppl.

Finally, compare these models in terms of test perplexity. We will report the results in a table (bold-face the best performing model in each category).

We will then pick the best model to do downstream topic analysis: Overall topic popularity. To help annotate the topics, we will average and re-normalize dynamic topic across times, so we have a fixed set of topics to visualize and manually annotate (same as Fig. 2 i have)

Country-specific topic popularity based on the DM-ETM results
Correlating dynamic topic with confirmed cases and deaths across times
(if space allows) predict interventions (comparing the best D-ETM and DM-ETM chosen based on the above val ppl) using a separate classifier (we will explore semi-supervised DM-ETM and D-ETM in the next paper)

li-lab-mcgill / covid19_media

Experiment Intervention predictions using topic mixture on GPHIN data #3