nickmckay / LiPD-utilities

Input/output and manipulation utilities for LiPD files in Matlab, R and Python
http://nickmckay.github.io/LiPD-utilities/
GNU General Public License v2.0
29 stars 9 forks source link

Python: possible to use nested dictionary keys in filterTS() criteria? #45

Closed leneklock closed 5 years ago

leneklock commented 5 years ago

Hi, I am using the LiPD utilities for Python and I would like to filter the a data set by the temporal resolution of the records. But the 'paleoData_hasResolution' is a dictionary in itself, and using the filterTs() command like this:

highres=lipd.filterTs(alldata,"paleoData_hasResolution['hasMedianValue']<5")

returns "Invalid input expression". Is it possible to use nested dictionary keys (not sure if that's the right term) in the filterTs() command? I can probably find a way around that problem, but it would be so handy to use the filterTs for this.

Thank you! Marlene

fzhu2e commented 5 years ago

Hi Marlene,

Thanks for the feedback!

I am not the developer of LiPD, but according to the source code, the regular expression that filterTs() uses to parse the expression is re.compile(r"((\w+_?)\s*(is|in|greater than|equals|equal|less than|<=|==|>=|=|>|<){1}[\"\s\']*([\"\s\w\d]+|-?\d+.?\d*)[\"\s&\']*)"), which can only parse a sentence in the structure as "key operator value". Therefore, I believe the nested dict keys are not supported here.

To filter a list of TS, it is convenient to write a for loop as the following:

ts_matches = []

n_matches = 0
for idx, ts_data in enumerate(ts_list):
    if 'paleoData_hasResolution' in ts_data and float(ts_data['paleoData_hasResolution']['hasMedianValue']) < 5:  # here you can write whatever complicated conditions you need
        ts_matches.append(ts_data)
        n_matches += 1

print(f'Find {n_matches} matches')   

Hope that helps!

Best, Feng

CommonClimate commented 5 years ago

Thanks @fzhu2e for this solution, which is unfortunately not very user-friendly. @chrismheiser, can filterTs() be amended to more naturally handle multiple queries? It is an extremely common use case. Deborah suggests flattening the nested dictionaries for filtering purposes, though you might well have other ideas.

Basically, if ts gathers all the timeseries objects, it would look like this:

selection = lipd.filterTs(ts, property1 = x1, property2 = x2, property3 = x3, ...)

chrismheiser commented 5 years ago

@fzhu2e that is exactly how it works.

@CommonClimate I could do either method. I sent an e-mail with explanations

fzhu2e commented 5 years ago

Hi @chrismheiser , I just wrote a prototype:

def filterTs(ts_list, *conditions):
    ''' Filter timeseries based on conditions

    Args:
        ts_list (list of LiPD timeseries): the list of timeseries to be filtered
        conditions* (multiple strings): the conditions based on which to filter; it must use "ts" as a placeholder
            for example: "float(ts['paleoData_hasResolution']['hasMedianValue']) < 5"

    Return:
        ts_matches (list of LiPD timeseries): the result matches
    '''
    ts_matches = []

    n_matches = 0
    for ts in ts_list:
        matches = True
        for cond in conditions:
            try:
                eval(cond)
                if not eval(cond):
                    matches = False
                    continue
            except:
                pass
        if matches:
            ts_matches.append(ts)
            n_matches += 1

    print(f'Find {n_matches} matches')  

    return ts_matches

The demo notebook: https://nbviewer.jupyter.org/gist/fzhu2e/d60b08f6a7a161b84284508504db4012

It uses eval, which is not safe, but it would be nice if we can somehow translate a string into a condition evaluation safely.

fzhu2e commented 5 years ago

@chrismheiser Another way I was thinking: we ask the users to write the key of the expression for a nested dictionary be like paleoData_hasResolution.hasMedianValue, i.e., multiple keys being connected with dots, then we parse it as a list of strings, and we are able to access the target value from the dictionary based on this information:

from copy import copy

key = 'paleoData_hasResolution.hasMedianValue'
strings = key.split('.')
print(strings)

for ts in ts_list[:1]:
    ts_tmp = copy(ts)
    for string in strings:
        try:
            ts_tmp = ts_tmp[string]
        except:
            pass

    target_value = ts_tmp
    print(target_value)

This way we don't have to do anything on the timeseries dictionary itself.

chrismheiser commented 5 years ago

I like the second idea much better. You can provide one filter, or a list of filters. I'd like to get rid of regexes if possible and do a split a string based on underscores or periods instead. (One or the other, not a mix of both). We don't need to worry about preserving the time series because it copies over the matched entries to a new time series.

leneklock commented 5 years ago

@fzhu2e thank you for the work around with the loop! That's more or less how I solved it in the end and it works fine.

chrismheiser commented 5 years ago

Quote from e-mail sent about the issue

The LiPD python package 0.2.6.5 now has the nested data update. This works for any level of nesting within the column level for extracting and collapsing a time series. It works the way we discussed earlier. Each underscore denotes another level of nesting. “paleoData_hasResolution_hasMedianValue” etc. QueryTs and FilterTs.

I’ll work on updating R to work the same way and possibly expand it to root data too if needed.

I didn't run into any errors in my tests, but post here if you have issues.