pvlib / pvanalytics

Quality control, filtering, feature labeling, and other tools for working with data from photovoltaic energy systems.
https://pvanalytics.readthedocs.io
MIT License
93 stars 30 forks source link

Column Names and Translation Dictionaries #68

Open bt- opened 4 years ago

bt- commented 4 years ago

@cwhanse and @wfvining, I'm considering if I should pull some of the functionality of pvcaptest out and put it into a separate package and I'd like to get your feedback on if pvanalytics might be a good place or if I should create a new package.

There are two closely related features that I'm considering pulling out:

A substantial amount of pvcaptest functionality depends on having a translation dictionary (CapData.column_groups). This approach was originally inspired by the Pecos package. Pecos enables using the translation dictionary concept, but doesn't generate them.

The more performance engineering work I do, especially on tests with longer time frames, the more I think it would be valuable to use Pecos. To facilitate this, I think it would make sense to move automatic translation dictionary generation out of pvcaptest into a more general purpose package (pvanalytics?) that can output a translation dictionary that can be used in both pvcaptest and pecos.

The pvcaptest code that generates translation dictionaries is contained in the translation dictionaries, group_columns function, and the __series_type function. This algorithm works surprisingly well given how rudimentary it is, but it could definitely be greatly improved.

I started the tools to rename columns based on how much variety there is in column names coming from a wide range of DAS/SCADA vendors and projects. I think this has to be a first step to get any type of reliable results from the algorithm to automatically generate the translation dictionary.

Look forward to hearing your thoughts!

wfvining commented 4 years ago

I think this would be a very good addition to PVAnanlytics! It seems almost indispensable for any kind of automated analysis.

I like the group_columns() function. In the PVAnalytics style it would probably need to take either a DataFrame or a list of names and return a dict mapping the input names to 'categorical names'. Pretty much what the function is already doing. What do you think of a pvanalytics.quality.names module for this and other related funcitonality (i.e. infer units from raw column names)?

cwhanse commented 4 years ago

@bt- is the scope to host translation tools and also a library of known translation dicts?

bt- commented 4 years ago

@cwhanse, I am thinking primarily hosting tools to create "translation dictionaries", where the translation dictionary is the mapping from measurement category id to groups of column names.

But, I do think there should be a library of dictionaries to facilitate renaming columns. As an example, this would be helpful for renaming data from AlsoEnergy where they seem to be consistent in using sun to identify POA irradiance to something more like poa irradiance. A basic version of this type of dictionary exists here Based on my experience, it will be more effective to rename columns and then try to group them.

@wfvining, when I review the library overview the intuitive location to me is under system. I think the end point of this work would be the ability to automatically extrapolate system characteristics (type and quantity of sensors and equipment) as much as possible. But, I'll defer to your understanding of how the library is structured. It would be nice to have a name for the module that conveys that the grouping functionality is in it, but I haven't thought of anything better than names.

wfvining commented 4 years ago

I was thinking about this in terms of quality control on the column names, not so much about identifying which sensors/equipment exist. I could see it going in system when you put it like that. In that case system.names doesn't make much sense to me, maybe system.sensors?

cwhanse commented 4 years ago

Maybe an io or iotools module? Since this feature is motivated by getting data into shape for the pvanalytic's functions.

wfvining commented 4 years ago

When I read io I think of functions for interacting with some external resource (a database, a file, or something on the network) as opposed to just identifying/manipulating column names for a data that is already in memory.

bt- commented 4 years ago

I'm having trouble thinking of a good name that encompasses the renaming and the grouping functionalities without falling back to a name like util.

What about one of these options: system.utils system.data_utils system.prep_data

Or, I could see io being a good location as well because

I envision having renamed data and the translation dictionary exported being helpful if you wanted to use them in Pecos in one workflow/notebook and then use the same translation dictionary again in pvcaptest or other workflow.

Maybe: io.naming and io.grouping ?

wfvining commented 4 years ago

We already have a pvanalytics.util module that could work. It doesn't currently have any public API functions, but I don't see why it couldn't.