nestauk / discovery_utils

Data analysis utils for Discovery Hub projects
MIT License
0 stars 0 forks source link

Improving Crunchbase getters #47

Closed beingkk closed 1 month ago

beingkk commented 1 month ago

Fixes #44

Modifies (rewrites) discovery_utils.getters.crunchbase

Adds CrunchbaseGetter class for easy and convenient fetching of the different Crunchbase data tables.

Instructions to the reviewers

Hey Rosie @RFOxbury, would appreciate if you can take a look at this PR and see that it makes sense! It's largely lifted from innovation_sweet_spots repo, so should be OK, but good to have another pair of eyes.

Hey Shabeer @shabrf and Solomon @sqr00t , no need to review this PR for you - just wanted to ask your opinion on how you would write tests for such a getter class. For example, I guess we could mock up an S3 bucket using LocalStack to check the logic... and then do the actual test of fetching the data from the real S3 bucket and running some data integrity tests? Just curious if you have seen something

Another question where I'd like your opinion - if I would like to add some utility functions that are not simply getters but say help query the Crunchbase data (eg, "find me all companies belonging to Crunchbase category X"). Do you think it makes sense to still add such a utility function in getters, or should we have a different category of utils for that? (eg, "analysis utils")

Ohh, and I had to downgrade Python version to 3.11 because pflake8 was giving an error - I guess some incompatibility with Python 3.12, not sure why it suddenly appeared!

Run Flake8...............................................................Failed
- hook id: flake8
- exit code: 1

Traceback (most recent call last):
  File "/Users/karlis.kanders/Code/discovery_utils/.venv/bin/pflake8", line 5, in <module>
    from pflake8.__main__ import main
  File "/Users/karlis.kanders/Code/discovery_utils/.venv/lib/python3.12/site-packages/pflake8/__init__.py", line 50, in <module>
    class DivertingSafeConfigParser(ConfigParserTomlMixin, configparser.SafeConfigParser):
                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'configparser' has no attribute 'SafeConfigParser'. Did you mean: 'RawConfigParser'?
beingkk commented 1 month ago

Thank you @RFOxbury, I'll add your suggestions in the documentation. I'm slowly writing Wiki entries for the repo, so that's probably a good place (or maybe for Crunchbase we'd do a private google doc, because it's proprietary dataset)

I'll go ahead and merge, thanks!