Access OGE outputs from Amazon S3

Purpose

Allow users to read OGE output data from Amazon S3. This is particularly useful when we want to use OGE outputs in a separate project. Closes CAR-3681

What the code is doing

Create a function setting the OGE data store. It is looking for an OGE_DATA_STORE environment variable. If it does not exist the data store is set to local and it will write/read data to/from the open_grud_emissions_data folder located in the users' $HOME (current behavior)

Testing

The feature has been tested setting the OGE_DATA_DIR environment variable to s3 in a project importing the oge package. A file stored on Amazon S3 was then successfully loaded using the pandas' read_csv function

Where to look

Pipfile and environment.yml where I have added the s3fs dependency as it is needed by pandas to read files located on Amazon S3. Note that I did not make sure that it works for conda, @grgmiller, can you try to generate a conda environment, install the dependencies and try to load a CSV file on S3 using pandas' read_csv.
README where documentation where added for users who want to import oge in their project to fetch OGE data outputs without running first the pipeline
the oge.filepaths module where the feature is implemented.
the data_pipeline script where we ensure that OGE_DATA_STORE is not set to s3

Usage Example/Visuals

Setting the OGE_DATA_STORE environment variable

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python
Python 3.11.2 (main, Nov  1 2023, 11:27:45) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ["OGE_DATA_STORE"] = "s3"
>>> from oge.filepaths import data_folder
>>> data_folder()
's3://open-grid-emissions/open_grid_emissions_data/'
>>>

Not setting it:

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python
Python 3.11.2 (main, Nov  1 2023, 11:27:45) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from oge.filepaths import data_folder
>>> data_folder()
'/Users/brdo/open_grid_emissions_data/'
>>>

Trying to run pipeline with the OGE_DATA_STORE set to s3 raises an OSError

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ export OGE_DATA_STORE=2
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ echo $OGE_DATA_STORE 
2
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python src/oge/data_pipeline.py --year 2020
Traceback (most recent call last):
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 642, in <module>
    main(sys.argv[1:])
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 73, in main
    raise OSError("Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'")
OSError: Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'

Review estimate

15min

Future work

N/A

Checklist

[x] Update the documentation to reflect changes made in this PR
[x] Format all updated python files using black
[x] Clear outputs from all notebooks modified
[x] Add docstrings and type hints to any new functions created

singularity-energy / open-grid-emissions