singularity-energy / open-grid-emissions

Tools for producing high-quality hourly generation and emissions data for U.S. electric grids
MIT License
67 stars 4 forks source link

Package project #323

Closed rouille closed 6 months ago

rouille commented 6 months ago

Purpose

Package project so it can be imported from anywhere.

What the code is doing

No code

Testing

Run whole pipeline successfully

Where to look

The most important file is the pyproject.toml file that encloses all the configurations for building the package such as the build backend, the specification of the metadata and the inclusion of data files in the source distribution

Usage Example/Visuals

Create an empty virtual environment setting python version to 3.11. In the virtual environment:

pip install build
python -m build

You should get the following prints:

* Creating virtualenv isolated environment...
* Installing packages in isolated environment... (hatchling)
* Getting build dependencies for sdist...
* Building sdist...
* Building wheel from sdist
* Creating virtualenv isolated environment...
* Installing packages in isolated environment... (hatchling)
* Getting build dependencies for wheel...
* Building wheel...
Successfully built oge-0.2.2.tar.gz and oge-0.2.2-py3-none-any.whl

and a newly created dist directory with two files therein: oge-0.2.2.tar.gz and oge-0.2.2-py3-none-any.whl. Then, in the root directory:

pip install .

that will install the oge package along with all its dependencies (can be seen with pip list). You can run the whole pipeline from anywhere within the virtual environment via:

>>> from oge import data_pipeline
>>> data_pipeline.main(["year=2021"])

The data can be accessed in the package directory, e.g., .virtualenvs/package/lib/python3.11/site-packages/data/ in my case.

Review estimate

30min

Future work

Upload the distribution archives on PyPi. The first step could be to do it on TestPyPI, which is a separate instance of the package index intended for testing and experimentation.

Checklist

grgmiller commented 6 months ago

Thanks Ben! One question is if we wanted to be able to access some of the manual data files in OGE from other repos (for example data/manual/ba_reference.csv, would there be a way to do that with how this is currently packaged? Would these data files have to live in src? Or could you just run oge.load_data.load_ba_reference() from the other repo and it would load the table?

rouille commented 6 months ago

Thanks Ben! One question is if we wanted to be able to access some of the manual data files in OGE from other repos (for example data/manual/ba_reference.csv, would there be a way to do that with how this is currently packaged? Would these data files have to live in src? Or could you just run oge.load_data.load_ba_reference() from the other repo and it would load the table?

When you install the package the data and config folders (and files within) are available and be accessed in .virtualenvs/package/lib/python3.11/site-packages/data/ in my case.. Also, any module and functions therein can access the data (I was able to run the pipeline). So:

>>> from oge.load_data import load_ba_reference
>>> ba = load_ba_reference()
>>> ba
    ba_code                                ba_name    ba_category timezone_reporting_eia930 timezone_local us_ba activation_date retirement_date                        ba_name_ferc activation_date_ferc retirement_date_ferc      source  ba_number
0      AEBN                       AESC, LLC - AEBN            NaN                       NaN            NaN   Yes      2004-01-01      2006-07-01                                 NaN                  NaN                  NaN        FERC          1
1       AEC          PowerSouth Energy Cooperative            NaN                US/Central     US/Central   Yes      2004-01-01      2021-09-01  Alabama Electric Cooperative, Inc.                  NaN                  NaN  EIA & FERC          2
2      AECI  Associated Electric Cooperative, Inc.            NaN                US/Central     US/Central   Yes      2004-01-01             NaT                                 NaN                  NaN                  NaN  EIA & FERC          3
3      AEGL                    AESC, LLC - Gleason            NaN                       NaN            NaN   Yes      2004-01-01      2006-07-01                                 NaN                  NaN                  NaN        FERC          4
4      AELC             AESC, LLC - Lincoln Center            NaN                       NaN            NaN   Yes      2004-01-01      2006-07-01                                 NaN                  NaN                  NaN        FERC          5
..      ...                                    ...            ...                       ...            ...   ...             ...             ...                                 ...                  ...                  ...         ...        ...
235    TXMS         No Balancing Authority - Texas  miscellaneous                US/Central     US/Central   Yes             NaT             NaT                                 NaN                  NaN                  NaN         NaN        948
236    UTMS          No Balancing Authority - Utah  miscellaneous               US/Mountain    US/Mountain   Yes             NaT             NaT                                 NaN                  NaN                  NaN         NaN        949
237    WAMS    No Balancing Authority - Washington  miscellaneous                US/Pacific     US/Pacific   Yes             NaT             NaT                                 NaN                  NaN                  NaN         NaN        953
238    WIMS     No Balancing Authority - Wisconsin  miscellaneous                US/Central     US/Central   Yes             NaT             NaT                                 NaN                  NaN                  NaN         NaN        955
239     NaN                 No Balancing Authority  miscellaneous                       NaN            NaN   Yes             NaT             NaT                                 NaN                  NaN                  NaN         NaN        999

[240 rows x 13 columns]

and this can be done from anywhere

rouille commented 6 months ago

This looks good. Before I approve/merge, I have a couple of questions for you:

1. Now that we've changed the import statements to use absolute references (e.g. `oge.load_data`, will these import statements continue to work if you just clone the repo locally and run functions/notebooks without installing the package? Or will you have to install the package to continue using locally?

I am currently running the data pipeline as a script: python data_pipeline.py --year 2021 from the src/oge folder. So far so good

2. Do we need to update the import statements in all of the notebooks so that they continue to work?

Thanks for pointing this out. I need to update the import statements in the notebook.

3. What are any next steps that will need to happen before oge can be installed in another repo? You mentioned something about PyPi? Will this branch need to be merged into main before oge can be packaged?

I need to double check that oge is not taken in the registry. I already checked and it was available. Then, we can upload the package to PyPi. You should create an account there. This can be done from any branch I believe but I recommend that in the future, we create a GitHub workflows (deploy.yml) that automatically upload the package each time a new release is created.

grgmiller commented 6 months ago

Next Steps

grgmiller commented 6 months ago

So I successfully installed oge in a virtual environment, although I'm running into an issue with the filepaths.

For me, oge is installed at C:\Users\greg.miller\.pyenv\pyenv-win\versions\3.11.4\Lib\site-packages\oge and the data folder is installed at C:\Users\greg.miller\.pyenv\pyenv-win\versions\3.11.4\Lib\site-packages\data.

This means that relative to the oge directory, the data folder is ../data. However, in filepaths, the top_folder is ../../../, which means that if I run a command to load data, it is looking in 'C:/Users/greg.miller/.pyenv/pyenv-win/versions/3.11.4/Lib/data, and raising a FileNotFound error.

It seems like a simple fix would be to just change "../../../" in filepaths.top_folder() to "../", but I'm not sure why you didn't run into this error on your machine @Ben?

I might also propose that if this data folder is going to live in site-packages, we may want to rename the directory to oge_data so that it is more obvious that it is related to this project and easier to find. We may also want to just move the data folder to src/oge/data (see analog of how pudl does this: https://github.com/catalyst-cooperative/pudl/tree/main/src/pudl/package_data).

I'm also noticing that eia930.py includes the following line: os.environ["GRIDEMISSIONS_CONFIG_FILE_PATH"] = top_folder("config/gridemissions.json"). This config folder lives outside of src, so I am also getting a FileNotFound error when I try to run from oge import eia930

Proposed fixes: