jdechalendar / gridemissions

Tools for power sector emissions tracking
MIT License
35 stars 6 forks source link

Missing data is represented as approx 1.0 #8

Closed gailin-p closed 1 year ago

gailin-p commented 2 years ago

During initialization of BaDataPyoCleaningModel, line 408 in clean.py, all generation source/region combinations are set to default 1.0. This includes many source/region combinations that do not exist, for example, nuclear generation in small western BAs. It also may include periods where generation data is missing.

After running physics-based cleaning, there are approximately 5 million values of approximately 1.0, resulting from the default values added during initialization. In most cases, the actual generation from those sources is zero.

Can missing source/region combinations be left out of the physics-based cleaning to avoid adding columns where there is no actual generation?

jdechalendar commented 1 year ago

Is your concern computational efficiency or accuracy of the results?

If accuracy The value of 1.0 was chosen as a "very small" value in the context of the numerical data that were being used. As you correctly pointed out, these values should stay small (if the data that are initially supplied to the algorithm are reasonable). So you can still identify them easily as missing after the cleaning job. By throwing away data that are below say 2.0 after physics-based cleaning, most of these data points will disappear, and you will be able to remove many of these columns because they will have no data.

If computational efficiency Then you are correct that having fewer columns would be more efficient, but that was not the main priority when this code was initially written. Expanding the data structure so that all source/region combinations exist made it easier to write the optimization program. But it should not be very difficult to modify the optimization program code to check whether a combination exists before including it.