Predicting new combinations of topics

A final component of the novelty measurement workstream, we'd like to predict new links between topics that have not yet co-occurred (or alternatively: highlight topic combinations that are "under-performing" compared to their semantic similarity).

We will adapt the approach from Tacchella et al. 2020 who used patent codes. Due to time constraints, let's limit the application to research papers for now.

This will include the following tasks

Fetch topics for each research papers (include all papers)
Use word2vec to create vector-representations of topics (consider topics as tokens/words, and papers as sentences)
Calculate semantic similarity (ie, cosine similarity) between topics using their word2vec vectors
Output 1: Identify semantically but not co-occurring topics > potential new links in the future!
Output 2: Look at semantic similarity versus topic pair co-occurrence values. Identify potentially "under-performing" pairs, whose semantic similarity suggest they should be more co-occurring (ie, their co-occurrence might fall below some threshold or trendline). We speculate, that these topic combinations might grow in the future.

Finally: Output 1 or Output 2 could be aggregated at a topic level, to highlight topics that we predict to experience more change in the future.

nestauk / dap_aria_mapping

Predicting new combinations of topics #70