tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
474 stars 173 forks source link

The `unknown` column appears after I join the 2 tables #1448

Closed luna02100 closed 7 months ago

luna02100 commented 7 months ago

I am using flights and weather data frames from the nycflights13 and I store them in my SQL database (sample_con). Where does this row_names came from?

> flights_query <- tbl(sample_con, "flights") %>%
  left_join(tbl(sample_con, "weather"))
> flights_query %>% show_query()
<SQL>
SELECT
  `flights`.*,
  `row_names`,
  `temp`,
  `dewp`,
  `humid`,
  `wind_dir`,
  `wind_speed`,
  `wind_gust`,
  `precip`,
  `pressure`,
  `visib`
FROM `flights`
LEFT JOIN `weather`
  ON (
    `flights`.`year` = `weather`.`year` AND
    `flights`.`month` = `weather`.`month` AND
    `flights`.`day` = `weather`.`day` AND
    `flights`.`origin` = `weather`.`origin` AND
    `flights`.`hour` = `weather`.`hour` AND
    `flights`.`time_hour` = `weather`.`time_hour`
  )
mgirlich commented 7 months ago

Could you please create a reprex for this? See https://dbplyr.tidyverse.org/articles/reprex.html for more information how to do this. And could you check whether this still happens with dbplyr 2.4.0 or even better if you check with the dev version: devtools::install_github("tidyverse/dbplyr"). Thanks 😄

luna02100 commented 7 months ago

Alright.

library(dplyr)
library(dbplyr)

flights_lazy <- lazy_frame(nycflights13::flights,con = simulate_odbc())
weather_lazy <- lazy_frame(nycflights13::weather,con = simulate_odbc())

After I uninstall dbplyr and reinstall it via devtools::install_github("tidyverse/dbplyr"):

flights_query_lazy <- flights_lazy %>%
    left_join(weather_lazy)
flights_query_lazy %>% show_query()

Output:

<SQL>
SELECT
  `df_LHS`.*,
  `temp`,
  `dewp`,
  `humid`,
  `wind_dir`,
  `wind_speed`,
  `wind_gust`,
  `precip`,
  `pressure`,
  `visib`
FROM `df` AS `df_LHS`
LEFT JOIN `df` AS `df_RHS`
  ON (
    `df_LHS`.`year` = `df_RHS`.`year` AND
    `df_LHS`.`month` = `df_RHS`.`month` AND
    `df_LHS`.`day` = `df_RHS`.`day` AND
    `df_LHS`.`origin` = `df_RHS`.`origin` AND
    `df_LHS`.`hour` = `df_RHS`.`hour` AND
    `df_LHS`.`time_hour` = `df_RHS`.`time_hour`
  )

And could you check whether this still happens with dbplyr 2.4.0 or even better if you check with the dev version: devtools::install_github("tidyverse/dbplyr"). Thanks 😄

flights_query <- tbl(sample_con, "flights") %>%
   left_join(tbl(sample_con, "weather"))

flights_query %>% show_query()
SELECT
  `flights`.*,
  `row_names`,
  `temp`,
  `dewp`,
  `humid`,
  `wind_dir`,
  `wind_speed`,
  `wind_gust`,
  `precip`,
  `pressure`,
  `visib`
FROM `flights`
LEFT JOIN `weather`
  ON (
    `flights`.`year` = `weather`.`year` AND
    `flights`.`month` = `weather`.`month` AND
    `flights`.`day` = `weather`.`day` AND
    `flights`.`origin` = `weather`.`origin` AND
    `flights`.`hour` = `weather`.`hour` AND
    `flights`.`time_hour` = `weather`.`time_hour`
  )
mgirlich commented 7 months ago

As you can see there is no row_names column. The second part is not a reprex as an important part is missing: how was the data written to the database. I am pretty sure you unintentionally created a row_names column when writing the data to the database. You could check that e.g. with DBI::dbListFields(sample_con, "weather").

luna02100 commented 7 months ago

Yeah, I see it now. After I copy the weather dataset from nycflights13 to my SQL database connection, the row_names column was spontaneously created. So I thought it might be a bug because when I overwrite the weather data frame, the row_names column was now clear.