haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.99k stars 1.12k forks source link

[Feature proposal] Dataframe merge by ID #690

Closed adamsar closed 1 month ago

adamsar commented 2 years ago

I've got a few different dataframes that I'd like to merge when doing calculating some regression, and right now I do so by converting to a matrix of doubles, aligning the rows by id, and then rebuilding a dataframe. In spark and pandas, they have utility methods that allow you to merge dataframes with a by option to specify which column is used to match the data.

Describe the solution you'd like Extend the merge method with either a simple by option to specific key to merge on, add a mergeWith method, or a MergeOptions parameter that contains information such as by (key to join on), and mergeType (inner vs outerjoins, left vs right join).

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html

haifengl commented 2 years ago

Are you interested in join or a simple merge? You can merge two or more data frames suppose that rows are in the same order with existing API.

adamsar commented 2 years ago

More of a join. I've got a lot of dataframes, including some I receive from other departments, and it's sometimes painful to get these into a cohesive, single dataframe that contains the feature set I need.

As an edit: This functionality is exactly what I'd like https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

haifengl commented 3 months ago

We add smile.data.SQL for database management that supports join. The query/join result will be return as DataFrame. See SQLTest for examples.