has2k1 / plydata

A grammar for data manipulation in Python
https://plydata.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
276 stars 11 forks source link

Filtering using boolean indexes? #20

Open georgemarrows opened 4 years ago

georgemarrows commented 4 years ago

Thank you very much for developing and maintaining plydata. It makes pandas usable for me.

In analysing data from some logs, I wanted to filter to rows which matched a regex. I ended up using pandas' boolean indexes:

df = df[df.logMessage.map(lambda x: bool(re.search('"success": true', x)))]

Is there a more plydata way to do that? I would have expected a filter verb or an overload of query

has2k1 commented 4 years ago

You can do it with query, but not in one expression

idx = df.logMessage.map(lambda x: bool(re.search('"success": true', x)))
df = df >> query('@idx')

filter is a python key word, so we can not use it as a verb, so we have query. Since filter in dplyr is very similar to pandas query method, I decided to just that without changing the API. This is a limitation.

You could also do it using the call helper verb, but it is not too pretty.

df = df >> call('__getitem__', df.logMessage.map(lambda x: bool(re.search('"success": true', x))))

Maybe the solution is to add a switch to query so that it can accept a boolean array. That would change the API, I need to think about it some more.

antonio-yu commented 4 years ago

Are there more convenient methods to use Regex in filtering rows,a.k.a. in function "query". In pandas,regular expressions seem to be limited in function DataFrame.query . For example: df.query('col.str.contains("a")') doesn't work. In plydata, same situation, df >> query('col.str.contains("a")') doesn't work.

While in R , df %>% filter(str_detect(col,'a')) is very useful. Regex is always needed in filtering rows ,so I wonder if there are some methods.

has2k1 commented 4 years ago

There isn't a way to filter using a regex.

df.query('col.str.contains("a")') doesn't work.

Yes it does not work. query is limited and will stay that way since it just passes on the task to pandas DataFrame.query, but there is room for an alternative more general method that would permit a specification like 'col.str.contains("a")' and so on.

antonio-yu commented 4 years ago

Thanks for reply. Regex is useful in filtering our chinese words by rows,pandas can handle that,not a big problem. Anyway,thanks a lot.

antonio-yu commented 4 years ago

Regex is useful in filtering our chinese words by rows,pandas can handle that,not a big problem.

The function query is just not that convenient since I get used to the 'filter' in R. I wish there would be an alternative method that could replace built-in 'query'.

Just a personl wish,hhh.

has2k1 commented 3 years ago

@antonio-yu, can you come up with a short specification/example of how you would expect regex filtering to work. Then we can start from there.

antonio-yu commented 3 years ago

@has2k1 ,hi, Hassan. Here is an example :

df = pd.DataFrame({
    'x': [0, 1, 2, 3],
    'y': ['zero', 'one', 'two', 'three']})
    x      y  
0  0   zero  
1  1    one  
2  2    two  
3  3  three 

Frist,I wanna select these rows in which y contains the key word 'o'

   x     y
0  0  zero
1  1   one
2  2   two               

Some different methods to do it by using regular expressions .

df1 <- df %>% filter(str_detect(y,'o')) #R

While in pandas and plydata ,the syntax df1= df >> query('y.str.contains("o")') doesn't work. I have to choose anothe way :df1=df[df.y.str.contains('o')].

I also tried the silimar package dplython, ther regrex works in pipe operation . df1=df >> sift(X.y.str.contains('o'))

Normally, I need to filter rows by regrex and then do all the further operations , like group and aggregation,in chainable way. df %>% filter( str_detect(y,'o')) %>% group_by(y) %>% summarise(num=n()) #R style

In pandas and plydata , the regrex doesn't work in function 'query ' df1 >> df.query('y.str.contains("o")') ,so I have to stop the pip operation and define a new varibale ,df1=df[df.y.str.contains('o')], then continue the further operations df1['num']=df1.groupby('y').transform('count').

The key point is that if one chain breaks ,the pip opreration stops. Regrex is always necessary when selecting rows and columns ,especilly in our chinese sentences. I wish the query in plydata could work the same way as filter in R, so that regrex, comparison operations and other boolean values can be parameters.

Hope that I were clear 😄. Best wishs!

has2k1 commented 3 years ago

Frist,I wanna select these rows in which y contains the key word 'o'

There are the query_all, query_at and query_if helpers, but I admit they are not easy to think of.

df = pd.DataFrame({
    'x': [0, 1, 2, 3],
    'y': ['zero', 'one', 'two', 'three']
})

df >> query_at('y', any_vars='{_}.str.contains("o")')

"""
   x     y
0  0  zero
1  1   one
2  2   two
"""

# However in this case since we are querying a single column we do not need
# to use the '{_}' placeholder for the column name.
df >> query_at('y', any_vars='y.str.contains("o")')
"""
   x     y
0  0  zero
1  1   one
2  2   two
"""

Normally, I need to filter rows by regrex and then do all the further operations

I got confused by what you meant by "regrex", I thought you meant regular expression.

has2k1 commented 3 years ago

I think it merits a second function. Now, what to call it sift, sieve, query2, ...?

antonio-yu commented 3 years ago

Sorry, I wrote the 'regrex' wrongly,I meant 'regex'. 😂

The functions query_all,query_at,query_if are equivalent to functions filter_all, filter_at, filter_if in R, right? They have the same logic. But in plydata, these three functions always need to select some columns and an argument 'any_vars= 'or ' all_vars = ' , sometimes it's really not easy to think of.

df = pd.DataFrame({
    'alpha': list('aaabbb'),
    'beta': list('babruq'),
    'theta': list('cdecde'),
    'x': [1, 2, 3, 4, 5, 6],
    'y': [6, 5, 4, 3, 2, 1],
    'z': [7, 9, 11, 8, 10, np.nan]
})

df >> query_at('beta',any_vars='beta.isin(["b"])')

'''
  alpha beta theta  x  y     z
0     a    b     c  1  6   7.0
2     a    b     e  3  4  11.0
'''

If there is a new functon , that allows me to write the code df >> query_at('beta',any_vars='beta.isin(["b"])') in the following simple way df >> new_function('beta.isin(["b"])'),just like the filter syntax in R df %>% filter (beta %in% ('b')), that would be easier and cooler . So , I think it just lacks a mirror function that has the same logic as function filter of R.

I think it merits a second function. Now, what to call it sift, sieve, query2, ...?

The query itself stays the same with panda DataFrame.query. So I think filter_by can be a good name, query_at equals filter_at, query_if equala filter_if, query_all equals filter_all, filter_by equals filter. This can be useful for users who convert R to Python.

antonio-yu commented 3 years ago

Hi , @has2k1 ,how is everything going. Another question about the regex in select.


#Select columns by regex.

df = pd.DataFrame({
    'alpha': list('aaabbb'),
    'beta': list('babruq'),
    'theta': list('cdecde'),
    'x': [1, 2, 3, 4, 5, 6],
    'y': [6, 5, 4, 3, 2, 1],
    'z': [7, 9, 11, 8, 10, np.nan]
})

# select columns that end with 'a', it works well. 
df >> selecet(endswith ='a')

 alpha beta theta
0     a    b     c
1     a    a     d
2     a    b     e
3     b    r     c
4     b    u     d
5     b    q     e 

# But how to select columns that don't end with 'a'?

df >> selecet(endswith !='a') , df >> selecet(endswith ='-a') ,   df >> selecet(~(endswith ='a')) ?
has2k1 commented 3 years ago

But how to select columns that don't end with 'a'?

You can use a regular expression

df >> select(matches='[^a]$')