vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.27k stars 590 forks source link

[FEATURE-REQUEST] Efficient Join on Index with Huge DataFrame #2262

Closed SohamTamba closed 1 year ago

SohamTamba commented 1 year ago

Description

I have a huge DataFrame X and a medium DataFrame m. m contains a column X_idx that corresponds to the row number of X I want to merge on. Note that X_idx can have duplicates.

Example:

len_X = 100*1000*1000
X = vaex.from_arrays(
    a=['a_'+str(i)  for i in range(len_X)],
    b=['b_'+str(i)  for i in range(len_X)] 
)
m = vaex.from_arrays(X_idx=[51, 73, 51], name=['fifty one', 'seventy three', 'fifty one (1)'])

Output:

_______________________________
X_idx | name     |   a   | b         
---|---------------|-----|-------
51 | 'fifty one'  | a_51 | b_51  
73 | 'seventy three'  | a_73 | b_73 
51 | 'fifty one (1)'  | a_51 | b_51  
-----------------------------------

A simple way to do this is

X['X_idx'] = np.arange(len(X))
m.join(
    X, how='inner', on='X_idx', allow_duplications=True, inplace=True
)

but there should be a more efficient way to do this since I am joining on the index column.

I tried:

m.join(
    X, how='inner', left_on='X_idx', right_on='#', allow_duplications=True, inplace=True
)

but I get the error ValueError: "expected one expression, got 0"

Is your feature request related to a problem? Please describe.

I am trying to use FAISS for Nearest Neighbor Search. This is a common task in Semantic Search Docs: https://github.com/facebookresearch/faiss/wiki/Getting-started FAISS provides the integer index of the row of the nearest neighbor. I then need to retrieve the row from the index.

The embedding DataFrame is huge so I need to optimize runtime & memory usage.

Additional context

I'm new to VAEX.

JovanVeljanoski commented 1 year ago

Hey,

so in your join operation, the right_on argument is # but what is that? You should put the column of the right dataframe that you are joining on.

So following your example.. if I just do this:


m.join(
    X, how='inner', left_on='X_idx', right_on='X_idx', allow_duplicatios=True, inplace=True
)

Things work as expected.

JovanVeljanoski commented 1 year ago

Hey @JovanVeljanoski i just had a quick look at the wiki and saw this

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python (versions 2 and 3). Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Facebook AI Research with help from external contributors.

Especially this part:

up to ones that possibly do not fit in RAM.

If you (or someone) can provide an example of this, maybe we can include a wrapper in vaex-ml so you can use it easily with vaex. We did something similar with annoy in the past. Just an idea at this point.

SohamTamba commented 1 year ago

Hi @JovanVeljanoski, To clarify, my question is how do I accomplish the above result more efficiently (reduce runtime)?

If I were using Pandas instead of VAEX, I would optimize the dataframe join using the default index

return m.join(
    X, 
    how='inner',  
    right_index = True, left_on='X_idx'
)

here I am merging on the default index of X which is numbered 0 to len(X)-1.

If Pandas did not have such a default index then I would create my own index to optimize the table join:

X['X_idx'] = np.arange(len(X))
X = X.set_index('X_idx') # Turn X['X_idx'] into an index to make join more efficient
return m.join(
    X, 
    how='inner',  
    right_index = True, left_on='X_idx'
)

Reference: https://www.statology.org/pandas-merge-on-index/

# refers to the default index column of VAEX.

print(X) returns the following if loaded as vaex.DataFrameLocal
_______________________________
#  | X_idx |   a   | b         
---|----|-----
0 | 0  | a_0 | b_0  
1 | 1 |  a_1 | b_1 
2 | 2 | a_2 | b_2  
.
.
.
.
.
.
.
--------------

I am making the assumption that the DataFrame X is indexed on # so searching for values in the column # is much more efficient than searching in X_idx. Please let me know if that is not true.

JovanVeljanoski commented 1 year ago

Vaex has no concept of index as pandas does. Any column can be used as a join key, and there is no performance impact. I know of a trick to make join more memory efficient at the expense of speed. But for performance, this is the "out of the box" performance..

SohamTamba commented 1 year ago

Got it: Indexing cannot be used to improve efficiency Thanks @JovanVeljanoski