uchicago-dsi / climate-cabinet-campaign-finance-tracker

0 stars 1 forks source link

Make orgs classification script into more well-defined pipeline #92

Open trevorspreadbury opened 5 months ago

trevorspreadbury commented 5 months ago

https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/blob/9db429d8f209843075b787b08dc4dded5f71a787/src/utils/orgs_classification_data_pipeline.py#L2

This seems like a good direction, but the exact purpose of this file is unclear/undocumented right now. My guess is the eventual idea is that we are combining a collection of raw data files with company information into a single csv with a well-defined schema that has details and classifications for companies. Convert this into functions and define that output schema (and it would be a good idea to do this with record linkage in mind).

Additionally right now work on these raw files is split between this and the EDA folder. The EDA folder is fine for EDA now, but shouldn't be part of final production pipeline. Move all the code for processing the raw data files into this pipeline. Then provide details in the data readme for where you retrieved each of these files.

Since the InfoGroup/DataAxel data is copywritten we can't make it publicly available in large chunks. In any case, these CSVs will grow quite large so we don't want them in the repository. Add a link to the output file of the pipeline in google drive