vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[FEATURE-REQUEST] Opposite String Startswith Search in VAEX Dataframe #2098

Open khanfarhan10 opened 2 years ago

khanfarhan10 commented 2 years ago

Description Say we have a string to match in a database,

We can accomplish that by a simple select and evaluate in VAEX:

df_vaex.select(df_vaex["name"].str.startswith(search_string))

However, this searches for search_string in Database Entries rather than Database Entries in search_string.

Can this be performed using Vaex?

Is your feature request related to a problem? Please describe.

search_string1 = "ASTHA MAT" 
search_string2 = "ASTHA MATERIALS INDIA" 

df_vaex.select(df_vaex["name"].str.startswith(search_string1))
df_vaex.evaluate(df_vaex["name"], selection=True)
# ASTHA MATERIALS

df_vaex.select(search_string2.startswith(df_vaex["name"]))
# TypeError: startswith first arg must be str or a tuple of str, not Expression

Additional context Would be great to have a reverse search technology in a vectorized fashion for quick searching as in pandas.Series.isin!

khanfarhan10 commented 2 years ago

@JovanVeljanoski any idea on existing approaches to solve this problem?

JovanVeljanoski commented 2 years ago

Vaex does have isin ... df.x.isin(my_list) for example.

khanfarhan10 commented 2 years ago

Hmmmph, it's strange I didn't notice that. However, can't find any documentation for it nor can find it by searching on the docs.

Nevertheless, Thanks for the quick response!

JovanVeljanoski commented 2 years ago

@khanfarhan10

Here it is: https://vaex.io/docs/api.html#vaex.expression.Expression.isin

khanfarhan10 commented 2 years ago

Hey, reopening this as the isin string operation does not achieve the desired results!

I need a similar function like startswith but that operates more like the startswith operator in a swapped way.

Lets say e is an element belonging to the dataframe's column,

What I wish to perform is :

search_string.startswith(e) rather than performing e.startswith(search_string)

for each element e in the df column

khanfarhan10 commented 2 years ago

For a minimalistic reproducible example :

dict_data = dict(name=["ASTHA MATERIALS" , "LOREM IPSUM" ], locationID=[5454,6767]) # with other cols as well

dict_data
# {'name': ['ASTHA MATERIALS', 'LOREM IPSUM'], 'locationID': [5454, 6767]}
df_vaex = vaex.from_dict(dict_data)
df_vaex
# name locationID
0 ASTHA MATERIALS 5454
1 LOREM IPSUM 6767
search_string = "ASTHA MAT"
df_vaex.select(df_vaex["name"].str.startswith(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f98906910>
# [
#  "ASTHA MATERIALS"
# ]
search_string = "ASTHA MATERIALS INDIA"
df_vaex.select(df_vaex["name"].str.startswith(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d75520>
# []

This is the search I need.

khanfarhan10 commented 2 years ago

Stuff I tried (and failed) :

df_vaex["name"].isin([search_string])
df_vaex["name"].str.find(search_string)>-1
df_vaex["name"].str.contains(search_string)
JovanVeljanoski commented 2 years ago

How would you do it with pandas?

khanfarhan10 commented 2 years ago

I can't find a workaround in pandas too.

Apparently, is this one of those sad cases that cannot be done without a extremely slow apply/lambda function?

maartenbreddels commented 2 years ago

looks that way.. although our apply should be parallel (multiprocessing), but it's a good point, I'll see if we can have the reverse without too many changes.

khanfarhan10 commented 2 years ago
search_string = "ASTHA MATERIALS INDIA"
# Try lambda and apply stuff
df_vaex.select(df_vaex.apply(lambda element: search_string.startswith(element), [df_vaex["name"]]))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d33c20>
# [
#   "ASTHA MATERIALS"
# ]

Working with this for now!

khanfarhan10 commented 2 years ago

Also notable to mention here that regexes created dynamically might be of some (but not great) help!

search_string = r"ASTHA MATERIALS\s?(INDIA)?"
df_vaex.select(df_vaex["name"].str.contains(search_string))
df_vaex.evaluate(df_vaex["name"], selection=True)
# <pyarrow.lib.StringArray object at 0x7f3f99d33f30>
# [
#   "ASTHA MATERIALS"
# ]
Ben-Epstein commented 2 years ago

@khanfarhan10 I would think regex is your best bet. If you want something a bit more specific you could do a registered function

import vaex
import pyarrow as pa

dict_data = dict(name=["ASTHA MATERIALS" , "LOREM IPSUM" ], locationID=[5454,6767]) # with other cols as well

df = vaex.from_dict(dict_data)
search_string = "ASTHA MATERIALS INDIA"

@vaex.register_function()
def str_contains_col(col_vals, str_search):
    return pa.array([str_search.startswith(v) for v in col_vals.to_pylist()])

df.func.str_contains_col(df["name"], search_string)
khanfarhan10 commented 2 years ago

That works like a rocket @Ben-Epstein !

Feel free to close/add documentation/samples. @maartenbreddels @JovanVeljanoski @djsutherland

khanfarhan10 commented 2 years ago

Just for comment this is still 10X slower than the same pandas equivalent, but yeah, does the job.

@vaex.register_function()
def str_contains_col(col_vals, str_search):
    # return pa.array([str_search.find(v) != -1 for v in col_vals.to_pylist()])
    return pa.array([v in str_search for v in col_vals.to_pylist()])
JovanVeljanoski commented 2 years ago

Why don't you use pandas @khanfarhan10 ?

khanfarhan10 commented 2 years ago

Yeah, can use pandas but had migrated to using VAEX as I was consistenly using it for projects. Apparently I believe there is no way to speed this up using Vaex.

JovanVeljanoski commented 2 years ago

Is it possible to provide some (fake) data to replicate this on our end. This might never be super fast, but I don't see a reason why it should be much slower than pandas. Worst case should be just as fast ( or just as slow).

I could be missing something tho.

ukanchan commented 1 year ago

Is it possible to use case in contains function in Vaex?

e.g. df.x.str.contains("Abc", case=False) // df is vaex dataframe

I want to get all records which has substring in particular column but without restricted to case sensitive .

maartenbreddels commented 1 year ago

Could you open a new issue for that, I don't think it's related to this issue right?

maartenbreddels commented 1 year ago

In case you don't, and I forget, this seemed to work:

image

(?i) makes it case insensitive

ukanchan commented 1 year ago

Thank you !!

ukanchan commented 1 year ago

How we can use ignore case in startswith and endswith in Vaex? I tried with regular expression like: startswith("(?i)"+val, regex = True). But it didn't work.

Hoping to hear back from the team. Thanks.