The claim is that the final taxonomy ought to use the method that is more robust to possible measurement / missing data errors. To that end, a sensitivity analysis will leverage a sample (or more) of datapoints to perform the following:
Identify groupings of words that arguably MUST co-occur in topics (ie quasi multicollinear). In picture, this is Tg (Topic groups).
Compute the probability with which a method captures these words in the same cluster, for each Level (axis L in picture).
Compute the above probability for varying levels of noise, corresponding to axis N.
Note that noise should be randomly sampled embeddings (if using non-sample observations, we're in effect instead capturing model convergence, see issue #46).
Arguably, this will allow to
optimise hyperparameters in a way that yields the most robust set of outputs (ie. the decrease in quality/purity is the lowest as more noise is introduced).
compare across methods and select the one that is overall more robust (ie. least sensitive to the introduction of noise).
Issues to resolve:
Noise: For comparability across methods to be viable, we must define our measure of noise as something comparable. Currently adding embeddings and randomly assigning to articles is not comparable (as network approaches are exposed to two sources of noise while semantic ones are agnostic about article-entity allocation).
Computational cost: Sensitivity analysis tends to be costly, as it requires many iterations (because law of large numbers) to yield reliable results. Using a sample reduces overhead costs, but it may also mask different convergence rates for methods (ie. performance on sample may differ from performance on entire dataset).
NB: This corresponds to bullet point Variance of resulting taxonomy based on initialisation / hyperparameters / introducing noise and checking if known words are always ending up in the same category in Emily's comment in #12.
The claim is that the final taxonomy ought to use the method that is more robust to possible measurement / missing data errors. To that end, a sensitivity analysis will leverage a sample (or more) of datapoints to perform the following:
Note that noise should be randomly sampled embeddings (if using non-sample observations, we're in effect instead capturing model convergence, see issue #46).
Arguably, this will allow to
Issues to resolve:
NB: This corresponds to bullet point Variance of resulting taxonomy based on initialisation / hyperparameters / introducing noise and checking if known words are always ending up in the same category in Emily's comment in #12.