Open georgemarrows opened 4 years ago
You can do it with query
, but not in one expression
idx = df.logMessage.map(lambda x: bool(re.search('"success": true', x)))
df = df >> query('@idx')
filter
is a python key word, so we can not use it as a verb, so we have query
. Since filter
in dplyr is very similar to pandas query
method, I decided to just that without changing the API. This is a limitation.
You could also do it using the call
helper verb, but it is not too pretty.
df = df >> call('__getitem__', df.logMessage.map(lambda x: bool(re.search('"success": true', x))))
Maybe the solution is to add a switch to query
so that it can accept a boolean array. That would change the API, I need to think about it some more.
Are there more convenient methods to use Regex in filtering rows,a.k.a. in function "query".
In pandas,regular expressions seem to be limited in function DataFrame.query .
For example:
df.query('col.str.contains("a")')
doesn't work.
In plydata, same situation, df >> query('col.str.contains("a")')
doesn't work.
While in R , df %>% filter(str_detect(col,'a'))
is very useful.
Regex is always needed in filtering rows ,so I wonder if there are some methods.
There isn't a way to filter using a regex.
df.query('col.str.contains("a")')
doesn't work.
Yes it does not work. query
is limited and will stay that way since it just passes on the task to pandas DataFrame.query
, but there is room for an alternative more general method that would permit a specification like 'col.str.contains("a")'
and so on.
Thanks for reply. Regex is useful in filtering our chinese words by rows,pandas can handle that,not a big problem. Anyway,thanks a lot.
Regex is useful in filtering our chinese words by rows,pandas can handle that,not a big problem.
The function query is just not that convenient since I get used to the 'filter' in R. I wish there would be an alternative method that could replace built-in 'query'.
Just a personl wish,hhh.
@antonio-yu, can you come up with a short specification/example of how you would expect regex filtering to work. Then we can start from there.
@has2k1 ,hi, Hassan. Here is an example :
df = pd.DataFrame({
'x': [0, 1, 2, 3],
'y': ['zero', 'one', 'two', 'three']})
x y
0 0 zero
1 1 one
2 2 two
3 3 three
Frist,I wanna select these rows in which y
contains the key word 'o'
x y
0 0 zero
1 1 one
2 2 two
Some different methods to do it by using regular expressions .
df1 <- df %>% filter(str_detect(y,'o')) #R
While in pandas and plydata ,the syntax df1= df >> query('y.str.contains("o")')
doesn't work.
I have to choose anothe way :df1=df[df.y.str.contains('o')]
.
I also tried the silimar package dplython
, ther regrex works in pipe operation .
df1=df >> sift(X.y.str.contains('o'))
Normally, I need to filter rows by regrex and then do all the further operations , like group and aggregation,in chainable way.
df %>% filter( str_detect(y,'o')) %>% group_by(y) %>% summarise(num=n()) #R style
In pandas and plydata , the regrex doesn't work in function 'query ' df1 >> df.query('y.str.contains("o")')
,so I have to stop the pip operation and define a new varibale ,df1=df[df.y.str.contains('o')]
, then continue the further operations df1['num']=df1.groupby('y').transform('count').
The key point is that if one chain breaks ,the pip opreration stops. Regrex is always necessary when selecting rows and columns ,especilly in our chinese sentences. I wish the query in plydata
could work the same way as filter in R
, so that regrex, comparison operations and other boolean values can be parameters.
Hope that I were clear 😄. Best wishs!
Frist,I wanna select these rows in which y contains the key word 'o'
There are the query_all, query_at and query_if helpers, but I admit they are not easy to think of.
df = pd.DataFrame({
'x': [0, 1, 2, 3],
'y': ['zero', 'one', 'two', 'three']
})
df >> query_at('y', any_vars='{_}.str.contains("o")')
"""
x y
0 0 zero
1 1 one
2 2 two
"""
# However in this case since we are querying a single column we do not need
# to use the '{_}' placeholder for the column name.
df >> query_at('y', any_vars='y.str.contains("o")')
"""
x y
0 0 zero
1 1 one
2 2 two
"""
Normally, I need to filter rows by regrex and then do all the further operations
I got confused by what you meant by "regrex", I thought you meant regular expression.
I think it merits a second function. Now, what to call it sift
, sieve
, query2
, ...?
Sorry, I wrote the 'regrex' wrongly,I meant 'regex'. 😂
The functions query_all,query_at,query_if are equivalent to functions filter_all, filter_at, filter_if in R, right? They have the same logic. But in plydata, these three functions always need to select some columns and an argument 'any_vars= '
or ' all_vars = '
, sometimes it's really not easy to think of.
df = pd.DataFrame({
'alpha': list('aaabbb'),
'beta': list('babruq'),
'theta': list('cdecde'),
'x': [1, 2, 3, 4, 5, 6],
'y': [6, 5, 4, 3, 2, 1],
'z': [7, 9, 11, 8, 10, np.nan]
})
df >> query_at('beta',any_vars='beta.isin(["b"])')
'''
alpha beta theta x y z
0 a b c 1 6 7.0
2 a b e 3 4 11.0
'''
If there is a new functon , that allows me to write the code df >> query_at('beta',any_vars='beta.isin(["b"])')
in the following simple way df >> new_function('beta.isin(["b"])')
,just like the filter syntax in R df %>% filter (beta %in% ('b'))
, that would be easier and cooler .
So , I think it just lacks a mirror function that has the same logic as function filter of R.
I think it merits a second function. Now, what to call it sift, sieve, query2, ...?
The query itself stays the same with panda DataFrame.query.
So I think filter_by
can be a good name, query_at equals filter_at, query_if equala filter_if, query_all equals filter_all, filter_by equals filter. This can be useful for users who convert R to Python.
Hi , @has2k1 ,how is everything going.
Another question about the regex in select
.
#Select columns by regex.
df = pd.DataFrame({
'alpha': list('aaabbb'),
'beta': list('babruq'),
'theta': list('cdecde'),
'x': [1, 2, 3, 4, 5, 6],
'y': [6, 5, 4, 3, 2, 1],
'z': [7, 9, 11, 8, 10, np.nan]
})
# select columns that end with 'a', it works well.
df >> selecet(endswith ='a')
alpha beta theta
0 a b c
1 a a d
2 a b e
3 b r c
4 b u d
5 b q e
# But how to select columns that don't end with 'a'?
df >> selecet(endswith !='a') , df >> selecet(endswith ='-a') , df >> selecet(~(endswith ='a')) ?
But how to select columns that don't end with 'a'?
You can use a regular expression
df >> select(matches='[^a]$')
Thank you very much for developing and maintaining plydata. It makes pandas usable for me.
In analysing data from some logs, I wanted to filter to rows which matched a regex. I ended up using pandas' boolean indexes:
Is there a more plydata way to do that? I would have expected a
filter
verb or an overload ofquery