vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[Q] Replacing and then counting substrings within a column of sentences #1035

Open jcalifornia opened 3 years ago

jcalifornia commented 3 years ago

Hi there, I was wondering if there is a preferred way of 1) mapping words within phrases to other words defined in a potentially large dictionary 2) count-encoding said mapped words

So for instance if I have the following within a data column

"Apple banana dog cat purple panda potato"
"Monkey green dog cat"

and the following mapping

{ 'apple': 'fruit', 'banana': 'fruit', 'potato': vegetable, 'dog': 'animal', 'cat': animal, 'panda': 'animal', 'purple': 'color', 'green': 'color'}

I would want the following as a result of 1):

"fruit fruit animal animal color animal vegetable"
"animal color animal animal"

Actually, in my application, the words would be separated by commas and not spaces.

Then, if I wanted to count encode, I would have the following as a result of 2)

fruit vegetable animal color
2 1 3 1
0 0 3 1

Is there an elegant way of doing this, or should I iterate through the set of values in the dictionary in combination with https://vaex.io/docs/api.html#vaex.expression.StringOperations.replace to accomplish this? Thanks

maartenbreddels commented 3 years ago

Hi Josh,

interesting question. Short term solution is indeed to use replace many times. I do feel however, there should be a better and faster way for this. Would you mind tokenizing/splitting the strings first, would that work?

cheers,

Maarten