nestauk / industrial_taxonomy

Refactor of nestauk/industrial-taxonomy which upon completion will replace it.
MIT License
3 stars 0 forks source link

glass clustering flow + util scripts #27

Closed Juan-Mateos closed 2 years ago

Juan-Mateos commented 2 years ago

Closes #19


Checklist:

Juan-Mateos commented 2 years ago

What this does

We cluster company descriptions in SIC codes. Flow and utilities scripts can be found in industrial_taxonomy/pipeline/cluster_glass.

The user can choose the minimum sector size to include in the clustering (defaults to 1000)

The flow includes branching for different values of an assigned_shares parameter that controls the number of companies within a SIC to assign into clusters. We can test performance for different parameter values downstream.

Before running

Running

Juan-Mateos commented 2 years ago

I did the following:

Generic stuff

flow.py

topic_utils.py

utils.py

requirements.txt

I didn't do the following

Misc

Points for clarification

Juan-Mateos commented 2 years ago

What I have done:

flow.py

utils.py

misc

What I haven't done

Juan-Mateos commented 2 years ago

What I have done:

flow.py

utils.py

Given issues loading saved models I opted to leave the sbmtm imports as they were previously.

I ran the flow on test-mode and it works.

Juan-Mateos commented 2 years ago

I turned get_descriptions_tokenised into an argument into make_sector_corpora

I ran it in test-mode and it works.