Closed SohamTamba closed 1 year ago
Hey,
so in your join
operation, the right_on
argument is #
but what is that? You should put the column of the right
dataframe that you are joining on.
So following your example.. if I just do this:
m.join(
X, how='inner', left_on='X_idx', right_on='X_idx', allow_duplicatios=True, inplace=True
)
Things work as expected.
Hey @JovanVeljanoski i just had a quick look at the wiki and saw this
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python (versions 2 and 3). Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Facebook AI Research with help from external contributors.
Especially this part:
up to ones that possibly do not fit in RAM.
If you (or someone) can provide an example of this, maybe we can include a wrapper in vaex-ml
so you can use it easily with vaex. We did something similar with annoy
in the past. Just an idea at this point.
Hi @JovanVeljanoski, To clarify, my question is how do I accomplish the above result more efficiently (reduce runtime)?
If I were using Pandas
instead of VAEX, I would optimize the dataframe join using the default index
return m.join(
X,
how='inner',
right_index = True, left_on='X_idx'
)
here I am merging on the default index of X which is numbered 0 to len(X)-1.
If Pandas
did not have such a default index then I would create my own index to optimize the table join:
X['X_idx'] = np.arange(len(X))
X = X.set_index('X_idx') # Turn X['X_idx'] into an index to make join more efficient
return m.join(
X,
how='inner',
right_index = True, left_on='X_idx'
)
Reference: https://www.statology.org/pandas-merge-on-index/
#
refers to the default index column of VAEX.
print(X) returns the following if loaded as vaex.DataFrameLocal
_______________________________
# | X_idx | a | b
---|----|-----
0 | 0 | a_0 | b_0
1 | 1 | a_1 | b_1
2 | 2 | a_2 | b_2
.
.
.
.
.
.
.
--------------
I am making the assumption that the DataFrame X is indexed on #
so searching for values in the column #
is much more efficient than searching in X_idx
. Please let me know if that is not true.
Vaex has no concept of index as pandas does. Any column can be used as a join key, and there is no performance impact. I know of a trick to make join more memory efficient at the expense of speed. But for performance, this is the "out of the box" performance..
Got it: Indexing cannot be used to improve efficiency Thanks @JovanVeljanoski
Description
I have a huge DataFrame
X
and a medium DataFramem
.m
contains a columnX_idx
that corresponds to the row number ofX
I want to merge on. Note that X_idx can have duplicates.Example:
Output:
A simple way to do this is
but there should be a more efficient way to do this since I am joining on the index column.
I tried:
but I get the error
ValueError: "expected one expression, got 0"
Is your feature request related to a problem? Please describe.
I am trying to use FAISS for Nearest Neighbor Search. This is a common task in Semantic Search Docs: https://github.com/facebookresearch/faiss/wiki/Getting-started FAISS provides the integer index of the row of the nearest neighbor. I then need to retrieve the row from the index.
The embedding DataFrame is huge so I need to optimize runtime & memory usage.
Additional context
I'm new to VAEX.