This is a messy update, as it brings together code for:
exploration of the overlapping taxonomy, using GMMs as the bottom layer to allow for entities to appear in multiple topics. Code was adapted elsewhere to produce relevant taxonomy outputs, and in pipeline/semantic_taxonomy/overlapping_exploration I grid search for hyperparameters of both GMM and PCA reduction on the embeddings (to reduce overfitting and facilitate entities to be shared across topics). The results suggest little overlap even in low-resolution versions (ie PCA with few dimensions).
simulate taxonomies under stress conditions, either by adding discarded noise back to the set of entities or using a sub-set of entities. These scripts are in folder pipeline/simulations, under either noise or sample folders. I also created naming pipelines for all these topics, were names are simply the counts of most frequent entities (counts are calculated from a subset of articles). I also write the pipelines to be able to run on GMM or GMM+PCA versions of the centroids taxonomy, but now this is irrelevant (having decided not to pursue the overlapping approach).
produce validation metrics for taxonomies, in folder pipeline/simulations. This creates data on topic sizes under stress conditions, on distances of entities to their topic centroids, and on pairwise entities manually identified as sharing a topic doing so in practice.
minor fixes to validation plots. In addition, if time allows, I'd like to refactor these and produce them similar to Emily (ie. output data, create getters for data, build plots dynamically in streamlit app), but this is not feasible.
Fixes issues #45 and #46
Instructions for Reviewer
Probably too many things for someone to Review, if anything check pipeline/simulations for code that is relevant for validation.
Checklist:
[x] I have refactored my code out from notebooks/
[x] I have checked the code runs
[x] I have tested the code
[x] I have run pre-commit and addressed any issues not automatically fixed
[x] I have merged any new changes from dev
[x] I have documented the code
[x] Major functions have docstrings
[x] Appropriate information has been added to READMEs
Description
This is a messy update, as it brings together code for:
pipeline/semantic_taxonomy/overlapping_exploration
I grid search for hyperparameters of both GMM and PCA reduction on the embeddings (to reduce overfitting and facilitate entities to be shared across topics). The results suggest little overlap even in low-resolution versions (ie PCA with few dimensions).pipeline/simulations
, under eithernoise
orsample
folders. I also created naming pipelines for all these topics, were names are simply the counts of most frequent entities (counts are calculated from a subset of articles). I also write the pipelines to be able to run on GMM or GMM+PCA versions of the centroids taxonomy, but now this is irrelevant (having decided not to pursue the overlapping approach).pipeline/simulations
. This creates data on topic sizes under stress conditions, on distances of entities to their topic centroids, and on pairwise entities manually identified as sharing a topic doing so in practice.Fixes issues #45 and #46
Instructions for Reviewer
Probably too many things for someone to Review, if anything check
pipeline/simulations
for code that is relevant for validation.Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s