ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.28k stars 595 forks source link

bug: ambiguous column name when joining multiple tables #5953

Closed brentgunderson closed 1 year ago

brentgunderson commented 1 year ago

What happened?

Below example is just to reproduce an error I seem to get when using data that I can't share. Using python 3.10.4, Windows-10-10.0.19044-SP0

import ibis

prod = "my-production-project"
ibq = ibis.bigquery.connect(project_id=prod)

schedules = ibq.table("bigquery-public-data.baseball.schedules")
games_post_wide = ibq.table("bigquery-public-data.baseball.games_post_wide")
games_wide = ibq.table("bigquery-public-data.baseball.games_wide")

expression = (
    schedules.join(games_post_wide, schedules.gameId == games_post_wide.gameId)
    .join(games_wide, schedules.gameId == games_wide.gameId)
    .select(schedules.gameNumber)
    .distinct()
)

ibis.show_sql(expression)

df = expression.execute()

Error seems to be: “BadRequest: 400 Column name gameId is ambiguous at [71:8]”

Not sure if this is a bug, or if I'm just structuring the joins incorrectly.

What version of ibis are you using?

5.0.0

What backend(s) are you using, if any?

BigQuery

Relevant log output

BadRequest: 400 Column name gameId is ambiguous at [71:8]

Code of Conduct

cpcloud commented 1 year ago

Hi @brentgunderson 👋🏻 !

This looks like a bug in the SQL we're generating.

For now, you can work around it by writing

.join(games_wide, "gameId")

instead of using the equality predicate.

We're cutting 5.1 next Tuesday, so a fix for this might make it in but it'll definitely be fixed in 6.0!

gforsyth commented 1 year ago

Hey @brentgunderson -- I believe this has been fixed by our recent work on join column renaming. Although the code as written above will no longer work, because it will result in column naming collisions (since our default is to rename duplicate columns to {name}_right).

This works:

[ins] In [1]: import ibis

[ins] In [2]: from ibis import _

[ins] In [3]: con = ibis.bigquery.connect()

[ins] In [4]: schedules = con.table("bigquery-public-data.baseball.schedules")

[ins] In [5]: games_wide = con.table("bigquery-public-data.baseball.games_wide")

[ins] In [6]: games_post_wide = con.table("bigquery-public-data.baseball.games_post_wide")

[ins] In [7]: expr = (
         ...:     schedules.join(
         ...:         games_post_wide, schedules.gameId == games_post_wide.gameId, rname="{name}_gpw"
         ...:     )
         ...:     .join(games_wide, _.gameId == games_wide.gameId, rname="{name}_gw")
         ...:     .select(_.gameNumber)
         ...:     .distinct()
         ...: )

[ins] In [8]: expr.execute(limit=10)
Out[8]: 
Empty DataFrame
Columns: [gameNumber]
Index: []

Note the result is empty because there is no overlap in gameId between games_wide and games_post_wide

Without passing arguments, you'll get an integrity error due to the name collisions:

[ins] In [10]: schedules.join(games_post_wide, schedules.gameId == games_post_wide.gameId).join(
          ...:     games_wide, _.gameId == games_wide.gameId
          ...: )
---------------------------------------------------------------------------
IntegrityError                            Traceback (most recent call last)
Cell In[10], line 1
----> 1 schedules.join(games_post_wide, schedules.gameId == games_post_wide.gameId).join(
      2     games_wide, _.gameId == games_wide.gameId
      3 )

File ~/github.com/ibis-project/ibis/ibis/expr/types/relations.py:2480, in Table.join(left, right, predicates, how, lname, rname)
   2477 if how in ("semi", "anti"):
   2478     return expr
-> 2480 return ops.relations._dedup_join_columns(expr, lname=lname, rname=rname)

File ~/github.com/ibis-project/ibis/ibis/expr/operations/relations.py:627, in _dedup_join_columns(expr, lname, rname)
    618 # Rename columns in the right table that overlap, dropping any columns that
    619 # are known to be equal to those in the left table
    620 right_projections = [
    621     right[column].name(rname.format(name=column) if rname else column)
    622     if column in overlap
   (...)
    625     if column not in equal
    626 ]
--> 627 return expr.select(left_projections + right_projections)

File ~/github.com/ibis-project/ibis/ibis/expr/types/relations.py:1735, in Table.select(self, *exprs, **named_exprs)
   1730 if not exprs:
   1731     raise com.IbisTypeError(
   1732         "You must select at least one column for a valid projection"
   1733     )
-> 1735 op = an.Projector(self, exprs).get_result()
   1737 return op.to_expr()

File ~/github.com/ibis-project/ibis/ibis/expr/analysis.py:542, in Projector.get_result(self)
    539     if fused_op is not None:
    540         return fused_op
--> 542 return ops.Selection(self.parent, self.clean_exprs)

File ~/github.com/ibis-project/ibis/ibis/common/grounds.py:25, in BaseMeta.__call__(cls, *args, **kwargs)
     24 def __call__(cls, *args, **kwargs) -> Base:
---> 25     return cls.__create__(*args, **kwargs)

File ~/github.com/ibis-project/ibis/ibis/common/grounds.py:96, in Annotable.__create__(cls, *args, **kwargs)
     92 @classmethod
     93 def __create__(cls, *args, **kwargs) -> Annotable:
     94     # construct the instance by passing the validated keyword arguments
     95     kwargs = cls.__signature__.validate(*args, **kwargs)
---> 96     return super().__create__(**kwargs)

File ~/github.com/ibis-project/ibis/ibis/expr/operations/relations.py:385, in Selection.__init__(self, table, selections, predicates, sort_keys, **kwargs)
    382     elif not shares_some_roots(predicate, table):
    383         raise com.RelationError("Predicate doesn't share any roots with table")
--> 385 super().__init__(
    386     table=table,
    387     selections=selections,
    388     predicates=predicates,
    389     sort_keys=sort_keys,
    390     **kwargs,
    391 )

File ~/github.com/ibis-project/ibis/ibis/common/grounds.py:112, in Annotable.__init__(self, **kwargs)
    110 for name, field in self.__attributes__.items():
    111     if isinstance(field, Attribute):
--> 112         if (value := field.initialize(self)) is not EMPTY:
    113             object.__setattr__(self, name, value)

File ~/github.com/ibis-project/ibis/ibis/common/annotations.py:85, in Attribute.initialize(self, this)
     83     return EMPTY
     84 elif callable(self._default):
---> 85     value = self._default(this)
     86 else:
     87     value = self._default

File ~/github.com/ibis-project/ibis/ibis/expr/operations/relations.py:359, in Projection.schema(self)
    356         names.extend(schema.names)
    357         types.extend(schema.types)
--> 359 return sch.schema(names, types)

File /nix/store/cw2xxyh09alhbxywlvvibkhwsl6439ix-python3-3.10.11-env/lib/python3.10/site-packages/multipledispatch/dispatcher.py:278, in Dispatcher.__call__(self, *args, **kwargs)
    276     self._cache[types] = func
    277 try:
--> 278     return func(*args, **kwargs)
    280 except MDNotImplementedError:
    281     funcs = self.dispatch_iter(*types)

File ~/github.com/ibis-project/ibis/ibis/expr/schema.py:319, in schema_from_names_types(names, types)
    317     for v in name_locs:
    318         duplicate_names.remove(v)
--> 319     raise IntegrityError(f'Duplicate column name(s): {duplicate_names}')
    321 # construct the schema
    322 fields = dict(zip(names, types))

IntegrityError: Duplicate column name(s): ['seasonId_right', 'year_right', 'startTime_right', 'attendance_right', 'dayNight_right', 'duration_right', 'awayTeamId_right', 'awayTeamName_right', 'homeTeamId_right', 'homeTeamName_right', 'status_right']
brentgunderson commented 1 year ago

Thanks for the note @gforsyth, appreciate it! I'll give that a shot