nestauk / dap_aria_mapping

Mapping technology innovation to support The Advanced Research and Innovation Agency (ARIA)
MIT License
1 stars 0 forks source link

Validation: Sensitivity analysis with respect to noise #45

Closed ampudia19 closed 1 year ago

ampudia19 commented 1 year ago

Image

The claim is that the final taxonomy ought to use the method that is more robust to possible measurement / missing data errors. To that end, a sensitivity analysis will leverage a sample (or more) of datapoints to perform the following:

  1. Identify groupings of words that arguably MUST co-occur in topics (ie quasi multicollinear). In picture, this is Tg (Topic groups).
  2. Compute the probability with which a method captures these words in the same cluster, for each Level (axis L in picture).
  3. Compute the above probability for varying levels of noise, corresponding to axis N.

Note that noise should be randomly sampled embeddings (if using non-sample observations, we're in effect instead capturing model convergence, see issue #46).

Arguably, this will allow to

Issues to resolve:

NB: This corresponds to bullet point Variance of resulting taxonomy based on initialisation / hyperparameters / introducing noise and checking if known words are always ending up in the same category in Emily's comment in #12.