pvlib / pvlib-python

A set of documented functions for simulating the performance of photovoltaic energy systems.
https://pvlib-python.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.19k stars 1k forks source link

Data structure(s) for PV measurement datasets such as I-V curves #469

Closed markcampanelli closed 6 years ago

markcampanelli commented 6 years ago

I'm looking for input on using/creating a "standard" data structure to store PV measurement datasets, such as I-V curves supporting IEC 61853-1. I'm thinking of something flexible/extensible and self-documenting (esp. w.r.t. units). This space also seems to intersect with time-series of I-V curves and maybe PECOS workflows.

For example, I have a collection of I-V-F-T curves, each with possibly varying numbers of points, that are each taken at a "nominal" matrix of effective irradiance F = 0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.1 (unitless) and temperature T = 15, 25, 50 degC. Sticking to just python and numpy (pandas doesn't seem like the right fit here), I came up with this dict-based structure:

data = {
('0.1', '15 degC'): {'v_V': numpy.array([v_1, v_2, ..., v_M]), 'i_A': numpy.array([i_1, i_2, ..., i_M]), 'f': numpy.array([f_1, f_2, ..., f_M]), 't_degC': numpy.array([t_1, t_2, ..., t_M])},
('0.2', '15 degC'): {'v_V': numpy.array([v_1, v_2, ..., v_N]), 'i_A': numpy.array([i_1, i_2, ..., i_N]), 'f': numpy.array([f_1, f_2, ..., f_N]), 't_degC': numpy.array([t_1, t_2, ..., t_N])},
...
}

In this case, I could retrieve the currents vector for a particular curve using data[('0.2', '15 degC')]['i_A']. I also need to concatenate (in a consistent order) all the currents, voltages, etc. from all the curves together. One could also imagine repeated I-V-F-T curve measurements at each nominal setting (with possibly a different number of points in each repetition).

The ordered-pair keys can also be sorted in various ways using sorted(), as long as the chosen strings don't cause ordering problems. Note that replacing the keys with timestamps would produce time-series I-V-F-T curve data.

cwhanse commented 6 years ago

@thunderfish24 I support this effort. FYI - we're planning a more formal taxonomy development as part of Orange Button that has these data within it's scope, but that's a year out. Some thoughts about python structure:

adriesse commented 6 years ago

Xarrays might work, but would suggest a simple 2D table (in Pandas) with columns such as:

You can then use query to extract subsets and/or reindex/pivot/unstack to rearrange the data for analysis. I tend to see units as meta data.

(The suggestion above is for internal python storage/data structures to facilitate analysis, which the question seems to be focusing on. External/file storage options would ideally not be python-specific.)

wholmgren commented 6 years ago

netCDF might be a good option here. This may be complicated enough and have enough different use cases that it's worth adopting a standard, well-supported framework instead of developing our own.

In some of our research projects, trying to apply our half-baked data models to the netCDF format revealed gaps in how we were thinking about the problem.

You could wrap up the netCDF data with a custom class, if desired.

markcampanelli commented 6 years ago

Thanks everyone for the thoughtful responses!

After looking a bit further, I would say that NetCDF has the most potential and the best "alignment" to this particular dataset that I have in mind. (pandas' Dataframe still seems less well aligned to me, but it's probably doable.) With netCDF, the data formatting and storage would be more standardized than what I suggested, and I agree that units are best kept in the metadata and out of the data's indexes. It seems that the standardization part with netCDF involves coming up with "conventions", analogous to the CF conventions and its units, that cover the various PV performance measurement/monitoring cases. (Perhaps one aligns the conventions with the eventual Orange Button taxonomy?) The devil might be in the details, though, so I am going to report back after I try to actually implement netCDF for my particular data at hand.

Regarding some of @cwhanse 's more specific comments:

wholmgren commented 6 years ago

I forgot to mention CF conventions! Developing similar conventions for pv data in netCDF format could be a high impact project.

cwhanse commented 6 years ago

what's a CF convention?

wholmgren commented 6 years ago

http://cfconventions.org/faq.html#what

markcampanelli commented 6 years ago

@mikofski Do you have any insight on how we might integrate, for example, netCDF files with pvfree?

(https://pvfree.alwaysdata.net/ is down ATM, so I couldn't look at the user interface to the existing data.)

dacoex commented 6 years ago

Responding to: https://github.com/pvlib/pvlib-python/issues/469#issuecomment-393633686 For columns please try to use the existing standards as much as possible. See: IEC 61724. see page 7 of https://ia801002.us.archive.org/6/items/gov.in.is.iec.61724.1998/is.iec.61724.1998.pdf (outdated standard, but 2017 version is +/- same) The data format could be, as suggested, csv/pandas for simple & netcdf or hdf5 for larger datasets.

adriesse commented 6 years ago

Actually this nomenclature is not suitable because of the Greek letters, subscripts and commas. The newer version of this standard (2017) is no better, nor is the variable list on the PVPMC website.

Or maybe I misinterpreted your suggestion?

cwhanse commented 6 years ago

@adriesse I had forgotten about that list at https://pvpmc.sandia.gov/resources-and-events/variable-list/ We compiled that list consistent with the notation on https://pvpmc.sandia.gov and the PVLib for MATLAB toolbox. But it's consistent with notation generally in PV-related literature, e.g., alpha for temperature coefficient for current. Do you see the spelled-out greek letters as an issue for column naming?

adriesse commented 6 years ago

Not a fundamental issue. But there aren't that many of them and they get reused a lot. Typesetting and programming have different needs and constraints for notation.

wholmgren commented 6 years ago

We also keep this related list in our own documentation: http://pvlib-python.readthedocs.io/en/latest/variables_style_rules.html I'd like to see more names fully spelled out, and fewer abbreviations and spelled-out greek letters. I'd vote for that in column/variable naming conventions for a data structure as well.

mikofski commented 6 years ago

‘There are only two hard things in Computer Science: cache invalidation and naming things.’

mikofski commented 6 years ago

Hi @thunderfish24 , sorry all, unfortunately some of my GitHub notifications have been going to spam ☹️

>>> import requests
>>> r = requests.get('https://pvfree.herokuapp.com/api/v1/pvmodule/?format=json&Name__icontains=Canadian%20Solar')
>>> import pprint
>>> pprint.pprint(r.json())
{'meta': {'limit': 20,
          'next': None,
          'offset': 0,
          'previous': None,
          'total_count': 2},
 'objects': [{'A': -3.40641,
              'A0': 0.928385,
              'A1': 0.068093,
              'A2': -0.0157738,
              'A3': 0.0016606,
              'A4': -6.93e-05,
              'Aimp': 0.000181,
              'Aisc': 0.000397,
              'Area': 1.701,
              'B': -0.0842075,
              'B0': 1.0,
              'B1': -0.002438,
              'B2': 0.0003103,
              'B3': -1.246e-05,
              'B4': 2.11e-07,
              'B5': -1.36e-09,
              'Bvmpo': -0.235488,
              'Bvoco': -0.21696,
              'C0': 1.01284,
              'C1': -0.0128398,
              'C2': 0.279317,
              'C3': -7.24463,
              'C4': 0.996446,
              'C5': 0.003554,
              'C6': 1.15535,
              'C7': -0.155353,
              'Cells_in_Series': 96,
              'DTC': 3.0,
              'FD': 1.0,
              'IXO': 4.97599,
              'IXXO': 3.18803,
              'Impo': 4.54629,
              'Isco': 5.09115,
              'Material': 10,
              'Mbvmp': 0.0,
              'Mbvoc': 0.0,
              'N': 1.4032,
              'Name': 'Canadian Solar CS5P-220M [ 2009]',
              'Notes': 'Source: Sandia National Laboratories Updated 9/25/2012 '
                       'Module Database',
              'Parallel_Strings': 1,
              'Vintage': '2009-01-01',
              'Vmpo': 48.3156,
              'Voco': 59.2608,
              'id': 114,
              'is_vintage_estimated': False,
              'resource_uri': '/api/v1/pvmodule/114/'},
             {'A': -3.6024,
              'A0': 0.9371,
              'A1': 0.06262,
              'A2': -0.01667,
              'A3': 0.002168,
              'A4': -0.0001087,
              'Aimp': -0.0001,
              'Aisc': 0.0005,
              'Area': 1.91,
              'B': -0.2106,
              'B0': 1.0,
              'B1': -0.00789,
              'B2': 0.0008656,
              'B3': -3.298e-05,
              'B4': 5.178e-07,
              'B5': -2.918e-09,
              'Bvmpo': -0.1634,
              'Bvoco': -0.1532,
              'C0': 1.0121,
              'C1': -0.0121,
              'C2': -0.171,
              'C3': -9.397451,
              'C4': None,
              'C5': None,
              'C6': None,
              'C7': None,
              'Cells_in_Series': 72,
              'DTC': 3.2,
              'FD': 1.0,
              'IXO': None,
              'IXXO': None,
              'Impo': 8.1359,
              'Isco': 8.6388,
              'Material': 10,
              'Mbvmp': 0.0,
              'Mbvoc': 0.0,
              'N': 1.0025,
              'Name': 'Canadian Solar CS6X-300M [2013]',
              'Notes': 'Source:  CFV Solar Test Lab.  Tested 2013.  Module '
                       '13022-08',
              'Parallel_Strings': 1,
              'Vintage': '2013-01-01',
              'Vmpo': 34.9531,
              'Voco': 43.5918,
              'id': 518,
              'is_vintage_estimated': False,
              'resource_uri': '/api/v1/pvmodule/518/'}]}

and snlinverter models

```python >>> r = requests.get('https://pvfree.herokuapp.com/api/v1/pvinverter/?format=json&Name__icontains=PVP&Paco__exact=260000') >>> pprint.pprint(r.json()) {'meta': {'limit': 20, 'next': None, 'offset': 0, 'previous': None, 'total_count': 4}, 'objects': [{'C0': -1.07933e-07, 'C1': 1.88514e-05, 'C2': 0.00151279, 'C3': -0.000697514, 'Idcmax': 791.29, 'Mppt_high': 480.0, 'Mppt_low': 295.0, 'Name': 'PV Powered: PVP260KW [480V] 480V [CEC 2018]', 'Paco': 260000.0, 'Pdco': 269830.0, 'Pnt': 67.0, 'Pso': 1006.34, 'Vac': 480.0, 'Vdcmax': 480.0, 'Vdco': 341.0, 'created_on': '2018-05-09', 'id': 3847, 'modified_on': '2018-05-09', 'resource_uri': '/api/v1/pvinverter/3847/'}, {'C0': -1.35676e-07, 'C1': 2.54289e-05, 'C2': 0.00206057, 'C3': -0.000253737, 'Idcmax': 849.99, 'Mppt_high': 480.0, 'Mppt_low': 265.0, 'Name': 'PV Powered: PVP260KW-LV [480V] 480V [CEC 2018]', 'Paco': 260000.0, 'Pdco': 271147.0, 'Pnt': 67.0, 'Pso': 1086.2, 'Vac': 480.0, 'Vdcmax': 480.0, 'Vdco': 319.0, 'created_on': '2018-05-09', 'id': 3848, 'modified_on': '2018-05-09', 'resource_uri': '/api/v1/pvinverter/3848/'}, {'C0': -1.03e-07, 'C1': 2.05e-05, 'C2': 0.00203, 'C3': -0.000443, 'Idcmax': 925.0, 'Mppt_high': 600.0, 'Mppt_low': 295.0, 'Name': 'PV Powered: PVP260kW 480V [CEC 2009]', 'Paco': 260000.0, 'Pdco': 270057.3609, 'Pnt': 67.0, 'Pso': 893.1837948, 'Vac': 480.0, 'Vdcmax': 600.0, 'Vdco': 343.3983333, 'created_on': '2018-05-09', 'id': 3849, 'modified_on': '2018-05-09', 'resource_uri': '/api/v1/pvinverter/3849/'}, {'C0': -1.33e-07, 'C1': 2.79e-05, 'C2': 0.00273, 'C3': 0.000131, 'Idcmax': 1030.0, 'Mppt_high': 600.0, 'Mppt_low': 265.0, 'Name': 'PV Powered: PVP260kW-LV 480V [CEC 2009]', 'Paco': 260000.0, 'Pdco': 271537.9777, 'Pnt': 67.0, 'Pso': 929.7589628, 'Vac': 480.0, 'Vdcmax': 600.0, 'Vdco': 322.2183333, 'created_on': '2018-05-09', 'id': 3850, 'modified_on': '2018-05-09', 'resource_uri': '/api/v1/pvinverter/3850/'}]} ```
import numpy as np
import pvlib

x = pvlib.pvsystem.singlediode(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
y = pvlib.pvsystem.singlediode(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)

my_dtype = np.dtype([
  ('i_l', float), ('i_0', float), ('r_s', float), ('r_sh', float), ('nNsVth', float),
  ('i_sc', float), ('v_oc', float), ('i_mp', float), ('v_mp', float),
  ('i', float, (1,100)), ('v', float, (1, 100))
])
my_data = np.array([
    (6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
     x['i_sc'], x['v_oc'], x['i_mp'], x['v_mp'], x['i'], x['v']),
    (5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
     y['i_sc'], y['v_oc'], y['i_mp'], y['v_mp'], y['i'], y['v'])
], my_dtype)

my_data['i_l']  # list of all photogenerated currents (`I_L`)
# array([6.1, 5.1])

my_data['i'][0]
# list of cell currents for first record

I'm not endorsing this, just making sure you all are aware of it. But note how you can make the cell current and voltage fields 1x100 since we know we'll set ivcurve_pnts to 100, we know this size. Also we are not likely to ever need to slice across this set of values, so it's okay that they are lumped in a field. Finally, you could just as easily make this entire array M by N dimensions, where, M is the number of temperatures (T), and N is the number of irradiances (E). Then to get the I-V curves for a particular (E, T) combination, just use my_data[m, n] where m is the index of the desired T, and n is the index of the desired E. Then you can do NumPy "fancy" indexing to get the i-v curve of at (E, T)

your_data = np.copy(my_data)
# reshape my_data and your_data from (2,) to (1, 2), and concatentate
# you could also use np.atleast_2d or np.tile probably, ...
# lots of options here, not sure best ...
all_data = np.concatenate([my_data.reshape(1,2), your_data.reshape(1,2)], axis=0)
all_data.shape
# (2, 2)
# now "fancy" indexing to get i-v curve at (E, T)
all_data[1, 1][['i', 'v']]
([[5.09950248e+00, 5.09674358e+00, 5.09398468e+00, 5.09122577e+00, 5.08846685e+00, ...]],
 [[ 0.        ,  0.339375  ,  0.67875   ,  1.01812499,  1.35749999, ...]])

then plot

import matplotlib.pyplot as plt
plt.ion()

v, i = all_data[1, 1][['v', 'i']]
plt.plot(v.flat, i.flat)
plt.grid()
plt.title('i-v curve at (E, T) from NumPy structured arrays')
plt.xlabel('voltage, V')
plt.ylabel('current, I')

figure_1

You could easily plot families of curves this same way.

markcampanelli commented 6 years ago

@adriesse It turns out pandas' multi-indexing dataframe is a "natural" solution (at least for my way of thinking) as well as mapping 1-1 to the underlying spreadsheet template that we are using for data collection. Below is an example with a matrix of fake I-V-F-H curves, each with 10 points. In the case with curves with differing numbers of points, some "trailing" missing values would be NaN and they would need to be accommodated, which could lead to significant wasted memory in some datasets. This also doesn't address the question of how to best store the dataframe.

import pandas as pd
index = pd.MultiIndex.from_product([['0.1', '0.2', '0.4', '0.6', '0.8', '1.0', '1.1'],
                                                          ['15', '25', '50'], 
                                                          ['v_V', 'i_A', 'f', 'h']],
                                                         names=['f_nom', 't_degC_nom', 'channel'])
df = pd.DataFrame(np.random.randn(index.size, 10), index=index)
print(df.loc[(['0.8', '1.0', '1.1'], ['15', '25', '50'], ['v_V', 'i_A']),::2])

gives

                                 0         2         4         6         8
f_nom t_degC_nom channel                                                  
0.8   15         v_V     -0.424951 -0.246160  0.369397 -0.250131 -2.175697
                 i_A     -0.558297  0.357848 -1.158237  0.471445  1.383800
      25         v_V      0.576494 -0.447756  0.383170  0.380588 -1.548071
                 i_A      0.147228 -0.820177  0.083555 -1.102742  0.917184
      50         v_V      1.417219  0.101926  0.865095 -0.375521  0.323528
                 i_A      1.180871  0.469004  0.301483 -1.834616  1.189444
1.0   15         v_V      0.331482 -1.144688  0.938213  1.029849  0.623912
                 i_A      1.825356  0.337394  1.961099 -1.697143  0.025176
      25         v_V     -0.212704 -1.479759  2.636582 -0.158017  0.262181
                 i_A      0.034549 -0.700572  1.698807 -0.324248  1.862543
      50         v_V      0.792924  1.491777 -0.197562 -0.360991 -0.507311
                 i_A     -0.804397 -0.011431  1.013257 -0.731444 -0.241442
1.1   15         v_V     -1.055600  0.214343  0.814874 -0.262117  0.101457
                 i_A     -1.489756  0.501986 -0.095838 -0.358071  0.593954
      25         v_V     -0.143782  0.241644 -2.829719  0.170969 -0.963931
                 i_A     -0.971334 -0.659448 -1.063498  0.377788  0.197118
      50         v_V      1.077378 -2.065788  1.238328  0.255115  0.196997
                 i_A     -0.746125  0.569585 -0.951298 -0.773195  0.082101
adriesse commented 6 years ago

As I mentioned, you can pivot 2-D data into other forms such as the one you show (see https://pandas.pydata.org/pandas-docs/stable/reshaping.html) to facilitate analysis or visualization.

I sometimes do that, but other times I just do a query to get the subset I want, which I find more straightforward than the multi-index syntax, but is probably less efficient.

mikofski commented 6 years ago

Sorry if my previous comment too long and meandering. ☹️

I should add that serializing and deserializing my example with h5py is trivial:

import numpy as np
import pvlib
import h5py

# create some data
x = pvlib.pvsystem.singlediode(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
y = pvlib.pvsystem.singlediode(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)

# set the dtypes to use as a structured array
my_dtype = np.dtype([
  ('i_l', float), ('i_0', float), ('r_s', float), ('r_sh', float), ('nNsVth', float),
  ('i_sc', float), ('v_oc', float), ('i_mp', float), ('v_mp', float),
  ('i', float, (1,100)), ('v', float, (1, 100))
])

# store the data in structured array, note that the IV curve is a nested
my_data = np.array([
    (6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
     x['i_sc'], x['v_oc'], x['i_mp'], x['v_mp'], x['i'], x['v']),
    (5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
     y['i_sc'], y['v_oc'], y['i_mp'], y['v_mp'], y['i'], y['v'])
], my_dtype)

# pretend that this is a grid of IV curves for matrix of (E, T)
your_data = np.copy(my_data)
# reshape my_data and your_data from (2,) to (1, 2),
# and concatentate to make fake grid
all_data = np.concatenate([my_data.reshape(1,2), your_data.reshape(1,2)], axis=0)

# output to a file
with h5py.File('THIS_IS_A_TEST_FILE.H5', 'w') as f:
    f['data'] = all_data  # key "data" is arbitrary, choose as many groups as you need

quit python and restart

import h5py
import numpy as np

# retrieve the data from file
with h5py.File('THIS_IS_A_TEST_FILE.H5', 'r') as f:
    all_data = np.array(f['data'])

# do some fancy indexing:
all_data[1,1][['i', 'v']]
# ([[5.09950248e+00, 5.09674358e+00, ..., 1.41340468e+00, 7.65749212e-01, 7.99360578e-15]],
#  [[ 0.        ,  0.339375  ,  0.67875   ,  ..., 32.91937481, 33.2587498 , 33.5981248 ]])

use record arrays instead of structured:

all_data_rec = np.rec.array(all_data)  # as record array
all_data_rec.i_l
# array([[6.1, 5.1],
#        [6.1, 5.1]])

AFAICT the only difference between structured and record arrays is the ability to use attributes for column names instead of fields.

cwhanse commented 6 years ago

@mikofski what might change if we want to store multiple IV curves, and the v vector has different lengths?

wholmgren commented 6 years ago

Are we still (or were we ever) discussing a pvlib enhancement? If no, let's at least close the issue if not move it elsewhere.

cwhanse commented 6 years ago

At the moment, the discussion is relevant to the demonstration data for #229 and possibly to whatever we do with #511. I'm OK closing this as an issue, and taking up the discussion when we have a specific implementation to review. I'd rather see a pull request targeting iotools for reading/writing IV curve data for use in pvlib.

markcampanelli commented 6 years ago

@mikofski You have convinced me to take a closer look at numpy's structured/record array :). The alternative that I choose (pandas vs. numpy) will mostly rely on which "feels" more lightweight and natural in terms of things like complex slicing, concatenation, dealing with I-V curves of different lengths, and handling repeated measurements. Oh, did I mention that I also have normal-incidence QE's at three temperatures for this dataset too?

I don't see any big issues saving either alternative to HDF5, but I do need to further investigate the storage of meta-data such as channel units as well as settle upon the names (and maybe a standards effort would ultimately prefer netCDF with PV-specific "conventions"). Finally, do you know if it makes sense to transfer the HDF5 over the wire for a REST API, or would you anticipate a server-side JSON conversion?

@adriesse Pandas' pivoting is impressive and thanks for bringing that tool to my attention. I'm hoping that the "raw" data structure can be organized (at least for the IEC 61853-1 use case) such that it could be readily "understood" by a human who loads it out of storage and displays the data object for the first time, and it seems like the multi-index setup accomplishes that well.

@wholmgren I will close this issue now, but @cwhanse please reference this use case as the Green Button initiative gets underway.