Closed Juan-Mateos closed 2 years ago
What this does
We cluster company descriptions in SIC codes. Flow and utilities scripts can be found in industrial_taxonomy/pipeline/cluster_glass
.
The user can choose the minimum sector size to include in the clustering (defaults to 1000)
The flow includes branching for different values of an assigned_shares
parameter that controls the number of companies within a SIC to assign into clusters. We can test performance for different parameter values downstream.
Before running
graph_tool
topsbm
by running gh repo clone martingerlach/hSBM_Topicmodel
inside industrial_taxonomy/pipeline/glass_clusters
(I couldn't import the topic model when installing via pip @bishax).
model.clusters
method in the sbmtm.py
script (see here for more information).Running
[10,100,500,0.5,"all"]
business descriptions for cluster took ca. 3.5 hours in my machine. If you just want to check that the code runs for you, run the flow with...
--test-mode=True
so it only runs on three SIC codes, --assigned_shares='[10,1000']
so that only does two extractions per SIC code.I did the following:
Generic stuff
hsbmtm
from the README
sbmtm.py
repoflow.py
test-mode
only runs when not with --production
.all_sectors_corpora
as a dict (converting it to a list was a leftover from an older version of the flow where I branched the flow over sectors)assigned-shares
parameter{assigned_shares:models, clusters}
dicts in the cluster_glass_descriptions
step as suggested and used the join
step to combine them into a single dict.topic_utils.py
fit_model
code within fit_model_sector
utils.py
partial
in the tokenising pipegl_sic4
nameglass_sic4_lookup
to the glass_house
getters and added new info to the docstring
apply
to slice
get_sector
requirements.txt
pip install topsbm
thing.I didn't do the following
conda
decorator to create a virtual environment with graph-tool
but this raised a pip install
error so I decided to follow your advice in Kuebiko: "stick to using your project’s main environment and only use Metaflow’s conda decorators when you definitely need them."
sbmtm.py
but that raised an error I assume is linked to subsequent changes in graph-tool
:minimize_nested_blockmodel_dl() got an unexpected keyword argument 'deg_corr'
{k:v for k,v in dict.items()}
Misc
strip_nes
is because we wouldn't want ngrams including generic NE names to appear in sector names generated later. Admittedly, this is unlikely to happen + we could also strip them when we name the sectors. I don't feel strongly either way.Points for clarification
What I have done:
flow.py
util
and topic_util
importsself.sectors
utils.py
big_
to sector_tokens_lookup
misc
utils/utils.py
as collections.py
What I haven't done
What I have done:
flow.py
mypy
and fixed all typing errorsglass_sic4_lookup
as a parameter into make_sector_corpora
self.sectors
assignmenttopic_utils
importutils.py
glass_sic4_lookup
into a parameter for make_sector_corpora
Given issues loading saved models I opted to leave the sbmtm
imports as they were previously.
I ran the flow on test-mode
and it works.
I turned get_descriptions_tokenised
into an argument into make_sector_corpora
I ran it in test-mode
and it works.
Closes #19
Checklist:
notebooks/
flake8
and addressed any linter erorspre-commit
and addressed any issues not automatically fixeddev
(or merged any new changes fromdev
)README
soutput/reports/