A Collate Data: Proposal to refactor the download section of the code to push up downloading of files itself to a separate function (similar to current download_files) and let a new get_links() function get just the links that need to be downloaded. Also allow download folder customization. This way, unit tests can run on get_links() much faster, as only sources whose webpages need to be scraped would require a network call*. Downloads requiring API (currently only cdsapi based sources) would be in a separate file and tested only when the api fetch code changes. The download function will use a mocked adapter requests-mock for unit tests to function without network access.
B Process Data: Refactor out plotting code to a separate file, also return dataframes instead of writing to CSV. Ingestion functionality (to CSV/Parquet) or to database is done by a separate ingest() function which can be tested in a Dockerized environment with Postgres.
Schemas: As requirements are fluid, schema validation beyond tests in processing code can be deferred. The pd.DataFrame.to_sql() function can automatically infer schemas for SQL databases, but also allows customization through defining schemas as a dictionary of column name to SQLAlchemy type, so we can add a schemas module for each source later.
The following diagram shows a schematic representation of this refactoring. Here source is a hierarchial source specifier like geospatial/gadm_admin_map corresponding to the Python function name that downloads a particular source. Data will be downloaded under the root download directory (default: data/sources) corresponding to this hierarchy and processed files will be also kept under the same hierarchy in a separate root folder (default: data/processed). Unit tests for data/processed will run on either fake data (ideally) or cached data on a AWS S3 bucket.
A Collate Data: Proposal to refactor the download section of the code to push up downloading of files itself to a separate function (similar to current
download_files
) and let a newget_links()
function get just the links that need to be downloaded. Also allow download folder customization. This way, unit tests can run onget_links()
much faster, as only sources whose webpages need to be scraped would require a network call*. Downloads requiring API (currently onlycdsapi
based sources) would be in a separate file and tested only when the api fetch code changes. The download function will use a mocked adapterrequests-mock
for unit tests to function without network access.B Process Data: Refactor out plotting code to a separate file, also return dataframes instead of writing to CSV. Ingestion functionality (to CSV/Parquet) or to database is done by a separate
ingest()
function which can be tested in a Dockerized environment with Postgres.Schemas: As requirements are fluid, schema validation beyond tests in processing code can be deferred. The
pd.DataFrame.to_sql()
function can automatically infer schemas for SQL databases, but also allows customization through defining schemas as a dictionary of column name to SQLAlchemy type, so we can add a schemas module for each source later.The following diagram shows a schematic representation of this refactoring. Here
source
is a hierarchial source specifier likegeospatial/gadm_admin_map
corresponding to the Python function name that downloads a particular source. Data will be downloaded under the root download directory (default:data/sources
) corresponding to this hierarchy and processed files will be also kept under the same hierarchy in a separate root folder (default:data/processed
). Unit tests fordata/processed
will run on either fake data (ideally) or cached data on a AWS S3 bucket.*post MVP: cache webpages so that
get_links()
unit tests can entirely run without network access.