Allow users to read OGE output data from Amazon S3. This is particularly useful when we want to use OGE outputs in a separate project. Closes CAR-3681
What the code is doing
Create a function setting the OGE data store. It is looking for an OGE_DATA_STORE environment variable. If it does not exist the data store is set to local and it will write/read data to/from the open_grud_emissions_data folder located in the users' $HOME (current behavior)
Testing
The feature has been tested setting the OGE_DATA_DIR environment variable to s3 in a project importing the oge package. A file stored on Amazon S3 was then successfully loaded using the pandas' read_csv function
Where to look
Pipfile and environment.yml where I have added the s3fs dependency as it is needed by pandas to read files located on Amazon S3. Note that I did not make sure that it works for conda, @grgmiller, can you try to generate a conda environment, install the dependencies and try to load a CSV file on S3 using pandas' read_csv.
README where documentation where added for users who want to import oge in their project to fetch OGE data outputs without running first the pipeline
the oge.filepaths module where the feature is implemented.
the data_pipeline script where we ensure that OGE_DATA_STORE is not set to s3
Usage Example/Visuals
Setting the OGE_DATA_STORE environment variable
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python
Python 3.11.2 (main, Nov 1 2023, 11:27:45) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ["OGE_DATA_STORE"] = "s3"
>>> from oge.filepaths import data_folder
>>> data_folder()
's3://open-grid-emissions/open_grid_emissions_data/'
>>>
Not setting it:
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python
Python 3.11.2 (main, Nov 1 2023, 11:27:45) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from oge.filepaths import data_folder
>>> data_folder()
'/Users/brdo/open_grid_emissions_data/'
>>>
Trying to run pipeline with the OGE_DATA_STORE set to s3 raises an OSError
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ export OGE_DATA_STORE=2
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ echo $OGE_DATA_STORE
2
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python src/oge/data_pipeline.py --year 2020
Traceback (most recent call last):
File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 642, in <module>
main(sys.argv[1:])
File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 73, in main
raise OSError("Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'")
OSError: Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'
Review estimate
15min
Future work
N/A
Checklist
[x] Update the documentation to reflect changes made in this PR
[x] Format all updated python files using black
[x] Clear outputs from all notebooks modified
[x] Add docstrings and type hints to any new functions created
Purpose
Allow users to read OGE output data from Amazon S3. This is particularly useful when we want to use OGE outputs in a separate project. Closes CAR-3681
What the code is doing
Create a function setting the OGE data store. It is looking for an
OGE_DATA_STORE
environment variable. If it does not exist the data store is set to local and it will write/read data to/from the open_grud_emissions_data folder located in the users' $HOME (current behavior)Testing
The feature has been tested setting the
OGE_DATA_DIR
environment variable tos3
in a project importing theoge
package. A file stored on Amazon S3 was then successfully loaded using thepandas
'read_csv
functionWhere to look
s3fs
dependency as it is needed by pandas to read files located on Amazon S3. Note that I did not make sure that it works for conda, @grgmiller, can you try to generate a conda environment, install the dependencies and try to load a CSV file on S3 usingpandas
'read_csv
.oge
in their project to fetch OGE data outputs without running first the pipelineoge.filepaths
module where the feature is implemented.data_pipeline
script where we ensure thatOGE_DATA_STORE
is not set tos3
Usage Example/Visuals
Setting the
OGE_DATA_STORE
environment variableNot setting it:
Trying to run pipeline with the
OGE_DATA_STORE
set tos3
raises anOSError
Review estimate
15min
Future work
N/A
Checklist
black