kraemer-lab / DART-Pipeline

Data analysis pipeline for the Dengue Advanced Readiness Tools (DART) project
https://dart-pipeline.readthedocs.io
MIT License
2 stars 0 forks source link

RFC Refactoring proposal #97

Closed abhidg closed 2 months ago

abhidg commented 4 months ago

A Collate Data: Proposal to refactor the download section of the code to push up downloading of files itself to a separate function (similar to current download_files) and let a new get_links() function get just the links that need to be downloaded. Also allow download folder customization. This way, unit tests can run on get_links() much faster, as only sources whose webpages need to be scraped would require a network call*. Downloads requiring API (currently only cdsapi based sources) would be in a separate file and tested only when the api fetch code changes. The download function will use a mocked adapter requests-mock for unit tests to function without network access.

B Process Data: Refactor out plotting code to a separate file, also return dataframes instead of writing to CSV. Ingestion functionality (to CSV/Parquet) or to database is done by a separate ingest() function which can be tested in a Dockerized environment with Postgres.

Schemas: As requirements are fluid, schema validation beyond tests in processing code can be deferred. The pd.DataFrame.to_sql() function can automatically infer schemas for SQL databases, but also allows customization through defining schemas as a dictionary of column name to SQLAlchemy type, so we can add a schemas module for each source later.

The following diagram shows a schematic representation of this refactoring. Here source is a hierarchial source specifier like geospatial/gadm_admin_map corresponding to the Python function name that downloads a particular source. Data will be downloaded under the root download directory (default: data/sources) corresponding to this hierarchy and processed files will be also kept under the same hierarchy in a separate root folder (default: data/processed). Unit tests for data/processed will run on either fake data (ideally) or cached data on a AWS S3 bucket.

graph TD;
web([web]);
DB[(DB)];
links["(source, [link1, link2, ...])"];
output["(source, dataframe)"]
web --> |"get_links(source, options)"| links --> |download| files[[files]] --> |process| output --> |ingest| DB;
output --> |"plot()"| plots[[plots]];
api([api]) --> |get_api| files;

*post MVP: cache webpages so that get_links() unit tests can entirely run without network access.