pvlib / pvlib-python

A set of documented functions for simulating the performance of photovoltaic energy systems.
https://pvlib-python.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.19k stars 997 forks source link

Clearsky functions handling tz-aware datetimes as tz-naive, contrary to user guide #2054

Closed yhkee0404 closed 1 month ago

yhkee0404 commented 5 months ago

Describe the bug Both of the functions clearsky.lookup_linke_turbidity and irradiance.get_extra_radiation are supposed to be tz-aware in accordance with the user guide, but take the tz-aware inputs as tz-naive in their implementations and return incorrect results. Another functions using these two functions may also have been affected by the errors; i.e. Location.get_clearsky, clearsky.ineichen, or clearsky.simplified_solis.

To Reproduce Refer to my comprehensive analysis on the .ipynb file simulated in the same way as the user guide. It seems too big for the browser to render because it shows me only this message: An error occurred. In a similar situation, please download the file or clone my repository to see the content.

Expected behavior There should be no items searched on the file by the keyword rows, each of which means errors subject to timezones.

Screenshots For instance, on the user guide, it prints: climatological linke_turbidity = 3.329496820286459. However, after converting the local timezone to utc, it prints on my analysis above another value: climatological linke_turbidity = 3.3247187176482633.

Versions:

Additional context Almost the same as #237

cwhanse commented 5 months ago

I'm not quite following. You write "Two functions clearsky.lookup_linke_turbidity and irradiance.get_extra_radiation is tz-naive but has been used as if they were tz-aware" but in the clear-sky user guide, the times input is localized

tus = Location(32.2, -111, 'US/Arizona', 700, 'Tucson')

times = pd.date_range(start='2016-07-01', end='2016-07-04', freq='1min', tz=tus.tz)
yhkee0404 commented 5 months ago

Hi, @cwhanse Thank you for your attention.

Yes, that localized times is a tz-aware input which is not a problem, but when it is given to the two functions they return unexpected results and this is the problem. That's why I said the functions are not tz-aware but has been misused as if they were tz-aware, and we are supposed to fix them to be tz-aware.

For instance, try converting the timezone of the times input using a common tz-aware function tz_convert and you can see the difference. This example was copied from the .ipynb file I mentioned above:

tus = Location(32.2, -111, 'US/Arizona', 700, 'Tucson')

times = pd.date_range(start='2016-07-01', end='2016-07-04', freq='1min', tz=tus.tz)

cs = tus.get_clearsky(times)  # ineichen with climatology table by default
cs_utc = tus.get_clearsky(times.tz_convert('utc')).tz_convert(tus.tz)  # ineichen with climatology table by default

assert cs.index.equals(cs_utc.index)
assert not ((cs < 0).any().any() or (cs_utc < 0).any().any())
print(len(cs))
df = cs.join(cs_utc, how = 'outer', rsuffix = '_utc')[~ cs.fillna(-1).eq(cs_utc.fillna(-1)).all(axis = 'columns')]
print(df.index.hour.unique(), df.index.minute.nunique())
df
4321
Index([17, 18, 19], dtype='int32') 60
                                  ghi         dni        dhi     ghi_utc  \
2016-07-01 17:00:00-07:00  460.378093  778.936621  70.469943  460.055525   
2016-07-01 17:01:00-07:00  456.733517  777.196738  70.143925  456.411549   
2016-07-01 17:02:00-07:00  453.083990  775.435849  69.816605  452.762630   
2016-07-01 17:03:00-07:00  449.429616  773.653621  69.487974  449.108869   
2016-07-01 17:04:00-07:00  445.770496  771.849713  69.158023  445.450370   
...                               ...         ...        ...         ...   
2016-07-03 19:28:00-07:00    0.372376    6.479330   0.286308    0.368759   
2016-07-03 19:29:00-07:00    0.232132    4.744294   0.182824    0.229742   
2016-07-03 19:30:00-07:00    0.131663    3.388042   0.106134    0.130224   
2016-07-03 19:31:00-07:00    0.063133    2.356295   0.052039    0.062400   
2016-07-03 19:32:00-07:00    0.019268    1.593766   0.016222    0.019030   

                              dni_utc    dhi_utc  
2016-07-01 17:00:00-07:00  777.658658  70.787077  
2016-07-01 17:01:00-07:00  775.913720  70.460150  
2016-07-01 17:02:00-07:00  774.147732  70.131910  
2016-07-01 17:03:00-07:00  772.360361  69.802348  
2016-07-01 17:04:00-07:00  770.551267  69.471455  
...                               ...        ...  
2016-07-03 19:28:00-07:00    6.330446   0.284669  
2016-07-03 19:29:00-07:00    4.628755   0.181634  
2016-07-03 19:30:00-07:00    3.300508   0.105355  
2016-07-03 19:31:00-07:00    2.291654   0.051610  
2016-07-03 19:32:00-07:00    1.547308   0.016073  

[459 rows x 6 columns]

Does that difference conform to an expected behavior? I think cs_utc is the more accurate result than cs on the user guide.

On the other hand, you can also see on the user guide another times input as below which is tz-naive, as well as the tz-aware or localized one you mentioned. Both tz-naive and tz-aware times inputs are used without distiction on the user guide, but it is highly likely that they have been misused because merely preprocessing the inputs by the well-known tz-aware function tz_convert makes the functions result another values.

Another example from my analysis above:

# times = pd.date_range(start='2015-01-01', end='2016-01-01', freq='1D')
times_h = pd.date_range(start='2015-01-01', end='2016-01-01', freq='1h')
times_h_kst = pd.date_range(start='2015-01-01', end='2016-01-01', freq='1h').tz_localize('utc').tz_convert('Asia/Seoul')

# sites = [(32, -111, 'Tucson1'), (32.2, -110.9, 'Tucson2'),
#          (33.5, -112.1, 'Phoenix'), (35.1, -106.6, 'Albuquerque')]
lat, lon, name = 35.1, -106.6, 'Albuquerque'

# for lat, lon, name in sites:
turbidity_h = pvlib.clearsky.lookup_linke_turbidity(times_h, lat, lon, interp_turbidity=False)
turbidity_h_kst = pvlib.clearsky.lookup_linke_turbidity(times_h_kst, lat, lon, interp_turbidity=False).tz_convert(None)

assert turbidity_h.index.equals(turbidity_h_kst.index)
assert not ((turbidity_h < 0).any() or (turbidity_h_kst < 0).any())
print(len(turbidity_h))
df = pd.concat([turbidity_h, turbidity_h_kst], axis = 'columns', join = 'outer')[~ turbidity_h.fillna(-1).eq(turbidity_h_kst.fillna(-1))]
print(df.index.hour.unique(), df.index.minute.nunique())
df
8761
Index([15, 16, 17, 18, 19, 20, 21, 22, 23], dtype='int32') 1
                        0     1
2015-01-31 15:00:00  2.05  1.95
2015-01-31 16:00:00  2.05  1.95
2015-01-31 17:00:00  2.05  1.95
2015-01-31 18:00:00  2.05  1.95
2015-01-31 19:00:00  2.05  1.95
...                   ...   ...
2015-12-31 19:00:00  2.25  2.05
2015-12-31 20:00:00  2.25  2.05
2015-12-31 21:00:00  2.25  2.05
2015-12-31 22:00:00  2.25  2.05
2015-12-31 23:00:00  2.25  2.05

[99 rows x 2 columns]

Is this an expected behavior? Isn't turbidity_h right and turbidity_h_kst wrong? I think ideally the results should be the same as turbidity_h, and a different turbidity_h_kst is a bug, just like #237 mentioned above.

To summarize, the user guide has considered those functions as if tz-aware, where any tz_convert of the input may not change their results. However, you can easily find it does indeed as in my .ipynb file, which means the functions are tz-naive against our expectations and should be fixed into tz-aware like in #2055.

yhkee0404 commented 5 months ago

Sorry for my poor English. I've thoroughly revised my words. Could you please review them again? @cwhanse

cwhanse commented 5 months ago

@yhkee0404 aha, I think I partly understand what's going on here. The Linke turbidity data distributed with pvlib are gridded (location), monthly values. These values (from an outside source) do not have datetime stamps, only a month. pvlib (somewhat arbitrarily) assigns these values to the middle day of each month so that a daily value can be determined by interpolation. That machinery is not shown to the user; it happens here: https://github.com/pvlib/pvlib-python/blob/d53f97e984bfdd268aa92f8bf482ced0edda0110/pvlib/clearsky.py#L228

The examples you provide are comparing the interpolated turbidity at the same location but localized to different timezones. The interpolation is being done using time.dayofyear. It looks to me that the dayofyear function may operating in local time, because the day of year is changed at midnight local, not midnight UTC.

Arguably, pandas had to make a decision how to return a day of year for each location, so I don't think this is a fault in pandas - its behaving as intended (I assume).

So: if we want Linke turbidity at a location to be consistent with equal timestamps, we need to work under the hood in pvlib to fix the interpolation.

import numpy as np
import pandas as pd
from pvlib.location import Location

tus = Location(32.2, -111, 'US/Arizona', 700, 'Tucson')

times = pd.date_range(start='2016-07-01', end='2016-07-04', freq='30min', tz=tus.tz)
times_utc = times.tz_convert('utc')

doy = times.dayofyear
doy_utc = times_utc.dayofyear

td = np.append(0, np.diff(times.dayofyear))  # add a zero so it's the right length
print(times[td > 0])

td_utc = np.append(0, np.diff(times_utc.dayofyear))  # add a zero so it's the right length
print(times_utc[td_utc > 0])
yhkee0404 commented 5 months ago

Thank you for your clear explanation! Could I add some details to your example?

  1. The inconsistent results of clearsky.lookup_linke_turbidity for each localized datetime depend not only on interp_turbidity=True and pandas.DatetimeIndex.dayofyear, but also on pandas.DatetimeIndex.month, regardless of interpolation. No matter why, the fundamental reason remains the same as you identified: the values are "changed at midnight local, not midnight UTC." Please note the influence of interp_turbidity and time.month:

https://github.com/pvlib/pvlib-python/blob/d53f97e984bfdd268aa92f8bf482ced0edda0110/pvlib/clearsky.py#L201-L205

  1. The variability in irradiance.get_extra_radiation values results from pandas.DatetimeIndex.dayofyear, pandas.DatetimeIndex.month, and pandas.DatetimeIndex.year. The root cause, again, is the localized definition of midnight. Refer to #2055 for more details. Or, you can start by examining the following code and trace through irradiance._handle_extra_radiation_types and tools._pandas_to_doy to identify pandas.DatetimeIndex.dayofyear, then follow through to solarposition.pyephem_earthsun_distance which uses ephem.Date assuming UTC, and finally to solarposition.nrel_earthsun_distance which references pandas.DatetimeIndex.year and pandas.DatetimeIndex.month.

https://github.com/pvlib/pvlib-python/blob/d53f97e984bfdd268aa92f8bf482ced0edda0110/pvlib/irradiance.py#L90-L111

  1. The assumed timezones for both functions seem to be UTC. As you mentioned, for clearsky.lookup_linke_turbidity "these values (from an outside source) do not have datetime stamps, only a month." Therefore, synchronizing input datetimes with the timezone distinguishing each day in the h5 file is key to resolving the timezone-related uncertainties in Linke turbidity.

https://github.com/pvlib/pvlib-python/blob/d53f97e984bfdd268aa92f8bf482ced0edda0110/pvlib/clearsky.py#L174-L193

Fortunately, we can experimentally confirm that both functions use UTC for dividing days without needing additional references. The following code provides evidence for irradiance.get_extra_radiation. Only assuming UTC makes the returned values unique for every day; localizing to any other timezone is awkward because two distinct values exist per each localized day then.

import pandas as pd
from pvlib.location import Location
import pvlib

tus = Location(32.2, -111, 'US/Arizona', 700, 'Tucson')

times = pd.date_range(start='2016-07-01', end='2016-07-04', freq='30min', tz=tus.tz)
times_utc = times.tz_convert('utc')

dni_extra = pvlib.irradiance.get_extra_radiation(times)
dni_extra_utc = pvlib.irradiance.get_extra_radiation(times_utc).tz_convert(tus.tz)

assert dni_extra.index.equals(dni_extra_utc.index)
assert not ((dni_extra < 0).any() or (dni_extra_utc < 0).any())
print(len(dni_extra))
df = pd.concat([dni_extra, dni_extra_utc], axis = 'columns', join = 'outer')[~ dni_extra.fillna(-1).eq(dni_extra_utc.fillna(-1))]
print(df.index.hour.unique(), df.index.minute.nunique())
print(dni_extra_utc.tz_convert('UTC').groupby(by=lambda x: x.day).unique())
df
145
Index([17, 18, 19, 20, 21, 22, 23], dtype='int32') 2
1    [1320.4980145041654]
2    [1320.4715353124743]
3    [1320.4577467900974]
4    [1320.4566508404746]
dtype: object
                                     0            1
2016-07-01 17:00:00-07:00  1320.498015  1320.471535
2016-07-01 17:30:00-07:00  1320.498015  1320.471535
2016-07-01 18:00:00-07:00  1320.498015  1320.471535
2016-07-01 18:30:00-07:00  1320.498015  1320.471535
2016-07-01 19:00:00-07:00  1320.498015  1320.471535
2016-07-01 19:30:00-07:00  1320.498015  1320.471535
2016-07-01 20:00:00-07:00  1320.498015  1320.471535
2016-07-01 20:30:00-07:00  1320.498015  1320.471535
2016-07-01 21:00:00-07:00  1320.498015  1320.471535
2016-07-01 21:30:00-07:00  1320.498015  1320.471535
2016-07-01 22:00:00-07:00  1320.498015  1320.471535
2016-07-01 22:30:00-07:00  1320.498015  1320.471535
2016-07-01 23:00:00-07:00  1320.498015  1320.471535
2016-07-01 23:30:00-07:00  1320.498015  1320.471535
2016-07-02 17:00:00-07:00  1320.471535  1320.457747
2016-07-02 17:30:00-07:00  1320.471535  1320.457747
2016-07-02 18:00:00-07:00  1320.471535  1320.457747
2016-07-02 18:30:00-07:00  1320.471535  1320.457747
2016-07-02 19:00:00-07:00  1320.471535  1320.457747
2016-07-02 19:30:00-07:00  1320.471535  1320.457747
2016-07-02 20:00:00-07:00  1320.471535  1320.457747
2016-07-02 20:30:00-07:00  1320.471535  1320.457747
2016-07-02 21:00:00-07:00  1320.471535  1320.457747
2016-07-02 21:30:00-07:00  1320.471535  1320.457747
2016-07-02 22:00:00-07:00  1320.471535  1320.457747
2016-07-02 22:30:00-07:00  1320.471535  1320.457747
2016-07-02 23:00:00-07:00  1320.471535  1320.457747
2016-07-02 23:30:00-07:00  1320.471535  1320.457747
2016-07-03 17:00:00-07:00  1320.457747  1320.456651
2016-07-03 17:30:00-07:00  1320.457747  1320.456651
2016-07-03 18:00:00-07:00  1320.457747  1320.456651
2016-07-03 18:30:00-07:00  1320.457747  1320.456651
2016-07-03 19:00:00-07:00  1320.457747  1320.456651
2016-07-03 19:30:00-07:00  1320.457747  1320.456651
2016-07-03 20:00:00-07:00  1320.457747  1320.456651
2016-07-03 20:30:00-07:00  1320.457747  1320.456651
2016-07-03 21:00:00-07:00  1320.457747  1320.456651
2016-07-03 21:30:00-07:00  1320.457747  1320.456651
2016-07-03 22:00:00-07:00  1320.457747  1320.456651
2016-07-03 22:30:00-07:00  1320.457747  1320.456651
2016-07-03 23:00:00-07:00  1320.457747  1320.456651
2016-07-03 23:30:00-07:00  1320.457747  1320.456651