nestauk / industrial_taxonomy

Refactor of nestauk/industrial-taxonomy which upon completion will replace it.
MIT License
3 stars 0 forks source link

Fetch and pre-process Companies House data #13

Closed bishax closed 2 years ago

bishax commented 2 years ago

Closes #3

bishax commented 2 years ago

@Juan-Mateos: schema currently remains as before. Here would be the time to request any minor schema changes.

E.g. I noticed that get_sector() has a column data_dump_date signalling the latest of the three months of Companies House data (the same three months as the Glass data was collected) that the sector assignment was from - this is probably not necessary to know for this project?

bishax commented 2 years ago

Tested and all works. Good to merge with a couple of observations.

* In `get_name()`, what is `name_age_index` for? Worth dropping?

n is companies name as of n renames ago (most companies not having renamed themselves)

* What does it mean for a company name to have an invalid date? Do we want to drop those that do? If not, I suppose we should remove the variable.

If you mean on get_name() then the date is the date that the name became invalid (i.e. it changed from that name to something else).

* Re your query about `data_dump_date` in sector I wasn't sure I understood what explains the difference. I expect it would be fine to drop -  99.8% of observations come from July 2020 in any case!

If they are in May and June but not July then that implies they have been removed from the register for some reason (e.g. dissolved).

I'll add documentation around the above

Juan-Mateos commented 2 years ago

All fine, gtm