Closed markcampanelli closed 6 years ago
@thunderfish24 I support this effort. FYI - we're planning a more formal taxonomy development as part of Orange Button that has these data within it's scope, but that's a year out. Some thoughts about python structure:
instead of 2 numpy.arrays, what about an array of tuples (v_V, I_A)? The voltage and current come in pairs. It may be awkward to work with them as tuple type, though.
the use of values in the keys (e..g, ('0.1', '15 decC')
) seems awkward - what if the effective irradiance isn't exactly 0.1? Also, the effective irradiance value is also in the dict. What about an ordinal index for the primary key, and a dict for a single curve with metadata as single fields and the IV curve in a dict, like `{'Ee' : value, 'Tc' : value, 'IVdata' : {'v_V' : array, 'I_A' : array, etc.} }
I'm now inclined that a class definition would be great here, because I can foresee that we'll want a number of convenience methods to access and manipulate the data in the set of IV curves. E.g., IVCurve.get_Ee_values would return a list of the nominal effective irradiances for a set of IV curves.
Notation - I think you are alone in using 'F' for effective irradiance. While I certain you have your reasons, I most often see 'E' for this quantity, and 'G' for broadband.
Xarrays might work, but would suggest a simple 2D table (in Pandas) with columns such as:
You can then use query to extract subsets and/or reindex/pivot/unstack to rearrange the data for analysis. I tend to see units as meta data.
(The suggestion above is for internal python storage/data structures to facilitate analysis, which the question seems to be focusing on. External/file storage options would ideally not be python-specific.)
netCDF might be a good option here. This may be complicated enough and have enough different use cases that it's worth adopting a standard, well-supported framework instead of developing our own.
In some of our research projects, trying to apply our half-baked data models to the netCDF format revealed gaps in how we were thinking about the problem.
You could wrap up the netCDF data with a custom class, if desired.
Thanks everyone for the thoughtful responses!
After looking a bit further, I would say that NetCDF has the most potential and the best "alignment" to this particular dataset that I have in mind. (pandas' Dataframe still seems less well aligned to me, but it's probably doable.) With netCDF, the data formatting and storage would be more standardized than what I suggested, and I agree that units are best kept in the metadata and out of the data's indexes. It seems that the standardization part with netCDF involves coming up with "conventions", analogous to the CF conventions and its units, that cover the various PV performance measurement/monitoring cases. (Perhaps one aligns the conventions with the eventual Orange Button taxonomy?) The devil might be in the details, though, so I am going to report back after I try to actually implement netCDF for my particular data at hand.
Regarding some of @cwhanse 's more specific comments:
I would have to play around with the tuple indexing. I'm hoping that netCDF will allow better "simultaneous slicing" of all four components from a single I-V-F-T curve.
The values used in the keys are the "nominal" F and T values for identifying a particular curve. The actual values (that may vary) during the I-V curve measurement are in the data vectors. This particular case is of "uncorrected" I-V curves taken in a measurement lab with a solar simulator that "flickers" and possible heating of the device during the I-V measurement. I suppose outdoor data would usually just have a single F and T for each curve, or possibly two values from pre- and post-curve sweep. I suppose that ideally one might like to have accurate timestamps on all the points, which I think would allow one to cover all these different setups more generally in the data structure. The calibration labs I know of don't record such timestamps :(. Indeed they usually don't actually record separate temperatures at each point in the sweep, which may be justified for fast sweeps where the 3 sig-fig temperature gauge doesn't respond quickly enough.
I agree that a wrapper class looks like an eventuality. However, we should strive to make the data structure as easy to query as possible "in the raw", which I see as a particular strength of pandas.
I think I switched to F because NREL used E for spectral irradiance, and, more importantly, my particular "effective irradiance" is defined as a ratio of short-circuit currents, F = Isc/Isc0, where Isc is at prevailing POA spectral+total irradiance and junction temperature and Isc0 is at reference conditions. This is also the value reported by a reference monitoring device via F = Isc/Isc0 = M*Isc,r/Isc0,r, where the spectral correction factor M changes with the temperature of both devices and there is a linearity assumption w.r.t. current vs. total irradiance (and, of course, a homogeneity assumption). This differs a bit from some of the other "flavors" of effective irradiance.
I forgot to mention CF conventions! Developing similar conventions for pv data in netCDF format could be a high impact project.
what's a CF convention?
@mikofski Do you have any insight on how we might integrate, for example, netCDF files with pvfree
?
(https://pvfree.alwaysdata.net/ is down ATM, so I couldn't look at the user interface to the existing data.)
Responding to: https://github.com/pvlib/pvlib-python/issues/469#issuecomment-393633686 For columns please try to use the existing standards as much as possible. See: IEC 61724. see page 7 of https://ia801002.us.archive.org/6/items/gov.in.is.iec.61724.1998/is.iec.61724.1998.pdf (outdated standard, but 2017 version is +/- same) The data format could be, as suggested, csv/pandas for simple & netcdf or hdf5 for larger datasets.
Actually this nomenclature is not suitable because of the Greek letters, subscripts and commas. The newer version of this standard (2017) is no better, nor is the variable list on the PVPMC website.
Or maybe I misinterpreted your suggestion?
@adriesse I had forgotten about that list at https://pvpmc.sandia.gov/resources-and-events/variable-list/ We compiled that list consistent with the notation on https://pvpmc.sandia.gov and the PVLib for MATLAB toolbox. But it's consistent with notation generally in PV-related literature, e.g., alpha for temperature coefficient for current. Do you see the spelled-out greek letters as an issue for column naming?
Not a fundamental issue. But there aren't that many of them and they get reused a lot. Typesetting and programming have different needs and constraints for notation.
We also keep this related list in our own documentation: http://pvlib-python.readthedocs.io/en/latest/variables_style_rules.html I'd like to see more names fully spelled out, and fewer abbreviations and spelled-out greek letters. I'd vote for that in column/variable naming conventions for a data structure as well.
‘There are only two hard things in Computer Science: cache invalidation and naming things.’
Hi @thunderfish24 , sorry all, unfortunately some of my GitHub notifications have been going to spam ☹️
sapm
>>> import requests
>>> r = requests.get('https://pvfree.herokuapp.com/api/v1/pvmodule/?format=json&Name__icontains=Canadian%20Solar')
>>> import pprint
>>> pprint.pprint(r.json())
{'meta': {'limit': 20,
'next': None,
'offset': 0,
'previous': None,
'total_count': 2},
'objects': [{'A': -3.40641,
'A0': 0.928385,
'A1': 0.068093,
'A2': -0.0157738,
'A3': 0.0016606,
'A4': -6.93e-05,
'Aimp': 0.000181,
'Aisc': 0.000397,
'Area': 1.701,
'B': -0.0842075,
'B0': 1.0,
'B1': -0.002438,
'B2': 0.0003103,
'B3': -1.246e-05,
'B4': 2.11e-07,
'B5': -1.36e-09,
'Bvmpo': -0.235488,
'Bvoco': -0.21696,
'C0': 1.01284,
'C1': -0.0128398,
'C2': 0.279317,
'C3': -7.24463,
'C4': 0.996446,
'C5': 0.003554,
'C6': 1.15535,
'C7': -0.155353,
'Cells_in_Series': 96,
'DTC': 3.0,
'FD': 1.0,
'IXO': 4.97599,
'IXXO': 3.18803,
'Impo': 4.54629,
'Isco': 5.09115,
'Material': 10,
'Mbvmp': 0.0,
'Mbvoc': 0.0,
'N': 1.4032,
'Name': 'Canadian Solar CS5P-220M [ 2009]',
'Notes': 'Source: Sandia National Laboratories Updated 9/25/2012 '
'Module Database',
'Parallel_Strings': 1,
'Vintage': '2009-01-01',
'Vmpo': 48.3156,
'Voco': 59.2608,
'id': 114,
'is_vintage_estimated': False,
'resource_uri': '/api/v1/pvmodule/114/'},
{'A': -3.6024,
'A0': 0.9371,
'A1': 0.06262,
'A2': -0.01667,
'A3': 0.002168,
'A4': -0.0001087,
'Aimp': -0.0001,
'Aisc': 0.0005,
'Area': 1.91,
'B': -0.2106,
'B0': 1.0,
'B1': -0.00789,
'B2': 0.0008656,
'B3': -3.298e-05,
'B4': 5.178e-07,
'B5': -2.918e-09,
'Bvmpo': -0.1634,
'Bvoco': -0.1532,
'C0': 1.0121,
'C1': -0.0121,
'C2': -0.171,
'C3': -9.397451,
'C4': None,
'C5': None,
'C6': None,
'C7': None,
'Cells_in_Series': 72,
'DTC': 3.2,
'FD': 1.0,
'IXO': None,
'IXXO': None,
'Impo': 8.1359,
'Isco': 8.6388,
'Material': 10,
'Mbvmp': 0.0,
'Mbvoc': 0.0,
'N': 1.0025,
'Name': 'Canadian Solar CS6X-300M [2013]',
'Notes': 'Source: CFV Solar Test Lab. Tested 2013. Module '
'13022-08',
'Parallel_Strings': 1,
'Vintage': '2013-01-01',
'Vmpo': 34.9531,
'Voco': 43.5918,
'id': 518,
'is_vintage_estimated': False,
'resource_uri': '/api/v1/pvmodule/518/'}]}
and snlinverter
models
my personal recommendation would be to use HDF5 and h5py
but netCDF is also good. AFAICT, XArray is an in-memory data structure, not a serialized format, but they recommend storing XArray as netCDF. Another in-memory data structure is Apache Arrow which serializes as Feather. There are probably many more, and many deciding factors that could influence your choice: maturity, performance, etc.
For an example of using netCDF4 with NumPy in Python see ecmwf_macc_tools.py
in my PVSC-44 repo which is applied to AOD data downloaded from the ECMWF API with the all_ecmwf_data.py
script, once you've registered and installed their python API client. See How to retrieve ECMWF Public Datasets. You can see how I use the Atmosphere
class in the ECMWF section of the Jupyter notebook.
as far as notation I'm in favor of syncing the PVPMC and PVLib notation as much as possible and using verbose names, ie: spell them out, as @wholmgren seems to indicate, but syncing later with OrangeButton and other constituents (CF, Unidata, etc.) will still be an oustanding issue. My advice is to document as much as possible, maybe start with that first, and make the documentation public.
one thing to note about heirarchical formats is that they are optimized to be sliced along certain dimensions, and will slow down in other directions so understanding how you want to use your data will instruct how you store it.
re: @cwhanse comment
instead of 2 numpy.arrays, what about an array of tuples (v_V, I_A)? The voltage and current come in pairs. It may be awkward to work with them as tuple type, though.
you can also use NumPy structured arrays with nested fields of arbitrary dimensions.
import numpy as np
import pvlib
x = pvlib.pvsystem.singlediode(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
y = pvlib.pvsystem.singlediode(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
my_dtype = np.dtype([
('i_l', float), ('i_0', float), ('r_s', float), ('r_sh', float), ('nNsVth', float),
('i_sc', float), ('v_oc', float), ('i_mp', float), ('v_mp', float),
('i', float, (1,100)), ('v', float, (1, 100))
])
my_data = np.array([
(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
x['i_sc'], x['v_oc'], x['i_mp'], x['v_mp'], x['i'], x['v']),
(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
y['i_sc'], y['v_oc'], y['i_mp'], y['v_mp'], y['i'], y['v'])
], my_dtype)
my_data['i_l'] # list of all photogenerated currents (`I_L`)
# array([6.1, 5.1])
my_data['i'][0]
# list of cell currents for first record
I'm not endorsing this, just making sure you all are aware of it. But note how you can make the cell current and voltage fields 1x100 since we know we'll set ivcurve_pnts
to 100, we know this size. Also we are not likely to ever need to slice across this set of values, so it's okay that they are lumped in a field. Finally, you could just as easily make this entire array M by N dimensions, where, M is the number of temperatures (T), and N is the number of irradiances (E). Then to get the I-V curves for a particular (E, T) combination, just use my_data[m, n]
where m
is the index of the desired T, and n
is the index of the desired E. Then you can do NumPy "fancy" indexing to get the i-v curve of at (E, T)
your_data = np.copy(my_data)
# reshape my_data and your_data from (2,) to (1, 2), and concatentate
# you could also use np.atleast_2d or np.tile probably, ...
# lots of options here, not sure best ...
all_data = np.concatenate([my_data.reshape(1,2), your_data.reshape(1,2)], axis=0)
all_data.shape
# (2, 2)
# now "fancy" indexing to get i-v curve at (E, T)
all_data[1, 1][['i', 'v']]
([[5.09950248e+00, 5.09674358e+00, 5.09398468e+00, 5.09122577e+00, 5.08846685e+00, ...]],
[[ 0. , 0.339375 , 0.67875 , 1.01812499, 1.35749999, ...]])
then plot
import matplotlib.pyplot as plt
plt.ion()
v, i = all_data[1, 1][['v', 'i']]
plt.plot(v.flat, i.flat)
plt.grid()
plt.title('i-v curve at (E, T) from NumPy structured arrays')
plt.xlabel('voltage, V')
plt.ylabel('current, I')
You could easily plot families of curves this same way.
I like this format better than making individual (i, v) pairs because as I said before, you would never need to use the data that way, so adding that extra dimension, would add a lot of unnecessary indices, blow up the size of your record and make retrieval that much slower. IMO even better would be to keep all of the currents and voltages for a single (E, T) pair in one field, just call it ('i-v', float, (2, 100))
for i-v curves that are always 100 points long, then just document somewhere which is in the 0th index and which is in the 1st index. Or keep i and v separate like I did in my example, but IMO don't break up each (i, v) pair.
also you can index into a dictionary using tuples, since they are immutable and therefore hashable, you can use any hashable index for a dictionary
@adriesse It turns out pandas' multi-indexing dataframe is a "natural" solution (at least for my way of thinking) as well as mapping 1-1 to the underlying spreadsheet template that we are using for data collection. Below is an example with a matrix of fake I-V-F-H curves, each with 10 points. In the case with curves with differing numbers of points, some "trailing" missing values would be NaN and they would need to be accommodated, which could lead to significant wasted memory in some datasets. This also doesn't address the question of how to best store the dataframe.
import pandas as pd
index = pd.MultiIndex.from_product([['0.1', '0.2', '0.4', '0.6', '0.8', '1.0', '1.1'],
['15', '25', '50'],
['v_V', 'i_A', 'f', 'h']],
names=['f_nom', 't_degC_nom', 'channel'])
df = pd.DataFrame(np.random.randn(index.size, 10), index=index)
print(df.loc[(['0.8', '1.0', '1.1'], ['15', '25', '50'], ['v_V', 'i_A']),::2])
gives
0 2 4 6 8
f_nom t_degC_nom channel
0.8 15 v_V -0.424951 -0.246160 0.369397 -0.250131 -2.175697
i_A -0.558297 0.357848 -1.158237 0.471445 1.383800
25 v_V 0.576494 -0.447756 0.383170 0.380588 -1.548071
i_A 0.147228 -0.820177 0.083555 -1.102742 0.917184
50 v_V 1.417219 0.101926 0.865095 -0.375521 0.323528
i_A 1.180871 0.469004 0.301483 -1.834616 1.189444
1.0 15 v_V 0.331482 -1.144688 0.938213 1.029849 0.623912
i_A 1.825356 0.337394 1.961099 -1.697143 0.025176
25 v_V -0.212704 -1.479759 2.636582 -0.158017 0.262181
i_A 0.034549 -0.700572 1.698807 -0.324248 1.862543
50 v_V 0.792924 1.491777 -0.197562 -0.360991 -0.507311
i_A -0.804397 -0.011431 1.013257 -0.731444 -0.241442
1.1 15 v_V -1.055600 0.214343 0.814874 -0.262117 0.101457
i_A -1.489756 0.501986 -0.095838 -0.358071 0.593954
25 v_V -0.143782 0.241644 -2.829719 0.170969 -0.963931
i_A -0.971334 -0.659448 -1.063498 0.377788 0.197118
50 v_V 1.077378 -2.065788 1.238328 0.255115 0.196997
i_A -0.746125 0.569585 -0.951298 -0.773195 0.082101
As I mentioned, you can pivot 2-D data into other forms such as the one you show (see https://pandas.pydata.org/pandas-docs/stable/reshaping.html) to facilitate analysis or visualization.
I sometimes do that, but other times I just do a query to get the subset I want, which I find more straightforward than the multi-index syntax, but is probably less efficient.
Sorry if my previous comment too long and meandering. ☹️
I should add that serializing and deserializing my example with h5py is trivial:
import numpy as np
import pvlib
import h5py
# create some data
x = pvlib.pvsystem.singlediode(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
y = pvlib.pvsystem.singlediode(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
# set the dtypes to use as a structured array
my_dtype = np.dtype([
('i_l', float), ('i_0', float), ('r_s', float), ('r_sh', float), ('nNsVth', float),
('i_sc', float), ('v_oc', float), ('i_mp', float), ('v_mp', float),
('i', float, (1,100)), ('v', float, (1, 100))
])
# store the data in structured array, note that the IV curve is a nested
my_data = np.array([
(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
x['i_sc'], x['v_oc'], x['i_mp'], x['v_mp'], x['i'], x['v']),
(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
y['i_sc'], y['v_oc'], y['i_mp'], y['v_mp'], y['i'], y['v'])
], my_dtype)
# pretend that this is a grid of IV curves for matrix of (E, T)
your_data = np.copy(my_data)
# reshape my_data and your_data from (2,) to (1, 2),
# and concatentate to make fake grid
all_data = np.concatenate([my_data.reshape(1,2), your_data.reshape(1,2)], axis=0)
# output to a file
with h5py.File('THIS_IS_A_TEST_FILE.H5', 'w') as f:
f['data'] = all_data # key "data" is arbitrary, choose as many groups as you need
quit python and restart
import h5py
import numpy as np
# retrieve the data from file
with h5py.File('THIS_IS_A_TEST_FILE.H5', 'r') as f:
all_data = np.array(f['data'])
# do some fancy indexing:
all_data[1,1][['i', 'v']]
# ([[5.09950248e+00, 5.09674358e+00, ..., 1.41340468e+00, 7.65749212e-01, 7.99360578e-15]],
# [[ 0. , 0.339375 , 0.67875 , ..., 32.91937481, 33.2587498 , 33.5981248 ]])
use record arrays instead of structured:
all_data_rec = np.rec.array(all_data) # as record array
all_data_rec.i_l
# array([[6.1, 5.1],
# [6.1, 5.1]])
AFAICT the only difference between structured and record arrays is the ability to use attributes for column names instead of fields.
@mikofski what might change if we want to store multiple IV curves, and the v
vector has different lengths?
Are we still (or were we ever) discussing a pvlib enhancement? If no, let's at least close the issue if not move it elsewhere.
At the moment, the discussion is relevant to the demonstration data for #229 and possibly to whatever we do with #511. I'm OK closing this as an issue, and taking up the discussion when we have a specific implementation to review. I'd rather see a pull request targeting iotools
for reading/writing IV curve data for use in pvlib.
@mikofski You have convinced me to take a closer look at numpy's structured/record array :). The alternative that I choose (pandas vs. numpy) will mostly rely on which "feels" more lightweight and natural in terms of things like complex slicing, concatenation, dealing with I-V curves of different lengths, and handling repeated measurements. Oh, did I mention that I also have normal-incidence QE's at three temperatures for this dataset too?
I don't see any big issues saving either alternative to HDF5, but I do need to further investigate the storage of meta-data such as channel units as well as settle upon the names (and maybe a standards effort would ultimately prefer netCDF with PV-specific "conventions"). Finally, do you know if it makes sense to transfer the HDF5 over the wire for a REST API, or would you anticipate a server-side JSON conversion?
@adriesse Pandas' pivoting is impressive and thanks for bringing that tool to my attention. I'm hoping that the "raw" data structure can be organized (at least for the IEC 61853-1 use case) such that it could be readily "understood" by a human who loads it out of storage and displays the data object for the first time, and it seems like the multi-index setup accomplishes that well.
@wholmgren I will close this issue now, but @cwhanse please reference this use case as the Green Button initiative gets underway.
I'm looking for input on using/creating a "standard" data structure to store PV measurement datasets, such as I-V curves supporting IEC 61853-1. I'm thinking of something flexible/extensible and self-documenting (esp. w.r.t. units). This space also seems to intersect with time-series of I-V curves and maybe PECOS workflows.
For example, I have a collection of I-V-F-T curves, each with possibly varying numbers of points, that are each taken at a "nominal" matrix of effective irradiance F = 0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.1 (unitless) and temperature T = 15, 25, 50 degC. Sticking to just python and numpy (pandas doesn't seem like the right fit here), I came up with this dict-based structure:
In this case, I could retrieve the currents vector for a particular curve using
data[('0.2', '15 degC')]['i_A']
. I also need to concatenate (in a consistent order) all the currents, voltages, etc. from all the curves together. One could also imagine repeated I-V-F-T curve measurements at each nominal setting (with possibly a different number of points in each repetition).The ordered-pair keys can also be sorted in various ways using
sorted()
, as long as the chosen strings don't cause ordering problems. Note that replacing the keys with timestamps would produce time-series I-V-F-T curve data.