pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.58k stars 17.9k forks source link

ENH: vlookup or similar #54005

Closed Cokesalokes closed 1 year ago

Cokesalokes commented 1 year ago

Feature Type

Problem Description

I need to join 2 or more DFs (DF1 left, and DF2 right), but I only need to pull some of the information from DF2 whose column (let's say col_3) matches a column in DF1 (let's say col_B).

What we have now is to use a join or merge DF_result= pd.merge(left=DF1, right=DF2, how='left', left_on='col_B', right_on='col_3')

But if there are more rows in DF2, that will make the number of rows in DF_result explode; whereas it should have the same number of rows as DF1.

This is the vlookup functionality from excel.

Feature Description

If I knew how to do this, I would.

Alternative Solutions

put everything into excel, and do a vlookup, but that only works if you have less than 200,000 rows, I am using CSVs with almost amillion....so excel keeps breaking.

Additional Context

Thanks for all your help, and sorry if this is already in the works. I know a lot of people have been looking for workarounds and such, but such a great feature for many.

rhshadrach commented 1 year ago

Thanks for the request. Can you provide a small example of input / output.

But if there are more rows in DF2, that will make the number of rows in DF_result explode; whereas it should have the same number of rows as DF1.

This isn't quite true - the result could be empty if there are no matches regardless of the number of rows in each DataFrame. So it's not quite clear what you're asking for. In particular, when there are multiple matches for one row in the left DataFrame (DF1), what should pandas do?

phofl commented 1 year ago

I would just drop duplicates beforehand if you only want to match with the first entry on the right side. You could also use isin to do a filter, we need a reproducible example to help here

Cokesalokes commented 1 year ago

I cannot seem to reproduce the issue on small scale, but I can show the in/out (obviously cannot share confidential data) Left_merge_issue

phofl commented 1 year ago

Yeah your merge column has duplicates, you should remove them if you don't want them. Closing, please ping to reopen if you can provide a reproducer and drop_duplicates is not sufficient

Cokesalokes commented 1 year ago

Yeah, apparently somewhere in lines 1,050 something.....Apologies

facepalm emogi