mc2-center / csbc-pson-dcc

Data coordination resources for the NCI CSBC and PS-ON consortia
1 stars 4 forks source link

Rows in merged project table don't match any project/project description rows #26

Closed andrewelamb closed 4 years ago

andrewelamb commented 4 years ago

The merged project table doesn't have project ids so to match them the following code is needed:

merged_project_tbl <- "SELECT * FROM syn21868602" %>% 
    synapser::synTableQuery(includeRowIdAndRowVersion = F) %>% 
    as.data.frame() %>% 
    dplyr::as_tibble()

project_tbl <- "SELECT * FROM syn21645147" %>% 
    synapser::synTableQuery() %>% 
    as.data.frame() %>% 
    dplyr::as_tibble()

description_project_tbl <- "SELECT * FROM syn21868407" %>% 
    synapser::synTableQuery() %>% 
    as.data.frame() %>% 
    dplyr::as_tibble()

project_id_tbl <-
    dplyr::inner_join(
        project_tbl, 
        description_project_tbl, 
        by = c("id" = "projectId")
    ) %>% 
    dplyr::select(.data$id, .data$displayName, .data$description)

merged_project_tbl2 <- merged_project_tbl %>% 
    dplyr::left_join(
        project_id_tbl, 
        by = c("projectName" = "displayName", "description")
    )

new_project_tbl <- merged_project_tbl2 %>% 
    dplyr::filter(is.na(id))

This results in 69 rows

jaeddy commented 4 years ago

Thanks, Andrew. I have a break in meetings and will work on this now.

jaeddy commented 4 years ago

Thanks for the demo code. I like the idea of using description to add uniqueness to the join — I ended up switching to grantName, which resulted in the correct (129) number of matches. This also ticks one of the boxes in #25. 🙂

Here's the code, if you're curious (I reused some pieces from portal_etl.Rmd, mostly because it was more familiar and easier to troubleshoot):

db_project_df <- get_table_df("syn21645147", cache = FALSE)
db_description_project <- get_table_df("syn21868407", cache = FALSE)
merged_project_tbl <- get_table_df("syn21868602", cache = FALSE)

merged_project_tmp <- db_project_df %>% 
  rename(projectId = id, projectName = name) %>%
  left_join(
    db_description_project, by = "projectId"
  ) %>% 
  left_join(
    db_grant_df %>% 
      select(grantId = id, grantName = name, grantType, consortiumId), 
    by = "grantId"
  ) %>% 
  left_join(db_theme_grant, by = "grantId") %>% 
  left_join(
    db_theme_df %>% 
      select(themeId = id, theme = displayName),
    by = "themeId"
  ) %>% 
  left_join(
    db_consortium_df %>% 
      select(consortiumId = id, consortium = displayName)
  ) %>% 
  select(projectId, projectName = displayName, grantName, projectType, 
         themeId, grantId, consortiumId) %>%
  distinct()

merged_formatted_project_tmp <- merged_project_tmp %>%
  filter(!is.na(grantId)) %>%
  group_by(projectId, projectName, projectType) %>%
  summarize(grantId = str_c(unique(grantId), collapse = ", "),
            grantName = str_c(unique(grantName), collapse = ", "),
            themeId = str_c(unique(themeId), collapse = ", "),
            consortiumId = str_c(unique(consortiumId), collapse = ", ")) %>%
  ungroup() %>%
  mutate(projectName = str_trim(projectName)) %>% 
  distinct()

merged_project_update <- merged_project_tbl %>% 
  mutate(projectName = str_trim(projectName),
         description = str_trim(description)) %>% 
  left_join(merged_formatted_project_tmp,
            by = c( "projectName", "grantName")) %>% 
  select(projectId, projectName, projectType, description,
         grantId, grantName, grantType,
         themeId, theme, consortiumId, consortium) %>% 
  distinct()

merged_project_syntable <- update_table("syn21868602", merged_project_update)