Update data_merge_module to handle a dataset without keys

kpagacz commented 3 years ago

Code to reproduce:

# Example with non-clinical data
app <- init(
  data = teal_data(dataset("iris", iris)),
  modules = list(
    tm_g_distribution(
      dist_var = data_extract_spec(
        dataname = "iris",
        select = select_spec(variable_choices("iris"), "Petal.Length")
      ),
      strata_var = data_extract_spec(
        dataname = "iris",
        filter = filter_spec(vars = choices_selected(variable_choices("iris"), "Species"), multiple = T)
      )
    )
  )
)
## Not run:
shinyApp(app$ui, app$server)

The error:

Expected behaviour: the Species variable is passed to the module and no error is retured.

nikolas-burkoff commented 3 years ago

So the issue seems to be your dataset doesn't have a key for the merge - this works:

library(teal.modules.general)

iris_with_keys <- iris
iris_with_keys$id <- seq_len(nrow(iris))

app <- init(
  data = teal_data(dataset("iris", iris_with_keys, keys = "id")),
  modules = list(
    tm_g_distribution(
      dist_var = data_extract_spec(
        dataname = "iris",
        select = select_spec(variable_choices("iris"), "Petal.Length")
      ),
      strata_var = data_extract_spec(
        dataname = "iris",
        filter = filter_spec(vars = choices_selected(variable_choices("iris"), "Species"), multiple = T)
      )
    )
  )
)

shinyApp(app$ui, app$server)

A couple of thoughts @pawelru , @kpagacz

Why do we need a merge here if there's one dataset (similar question for outliers module as well)?
For a teal rather than cdisc dataset - if no key is given, should we add an ID column (_teal_id) and set it as a primary key so modules like this work?

kpagacz commented 3 years ago

Regarding your points - I had an issue or a comment with these exact thoughts. I cannot find it right now, but I threw an idea that we might want to add a dummy key column for a single dataset if no default is provided because the description of the key parameter doesn't lead to believe that it is needed for a single dataset.

pawelru commented 3 years ago

Thanks guys for the ticket. I agree that in general the app should handle such case without the need of specifying the keys but I can't accept a solution in which user needs to create artificial column in order to perform merge to self. This should be "invisibly" handled my merge module so that there is no new column in the output. We can temporarily create it but we need to remove it afterwards. Anyway, that seems to be a new feature thus moving to the main backlog

Polkas commented 3 years ago

Possibly needed a transfer to other repo and title update.

kpagacz commented 2 years ago

I agree with @polkas - update title, transfer - are you working on something else, @nikolas-burkoff ?

nikolas-burkoff commented 2 years ago

Nope - I've unassigned

kpagacz commented 2 years ago

@mhallal1 The same question as to Nik - what are your plans? And if you want to unassign later, would you mind renaming this issue and putting it in the backlog?

mhallal1 commented 2 years ago

@kpagacz the title and repo have been updated. I am assessing the range of this issue as it touches the main concept of merging datasets that we follow now.

mhallal1 commented 2 years ago

So the basic way that we handle an example with one non-cdisc dataset with keys (Nik's example above) is as follows:

ANL_1 <- iris_FILTERED %>% dplyr::select(id, Petal.Length)
ANL_2 <- iris_FILTERED %>% dplyr::filter(Species == "setosa") %>% dplyr::select(id, Species)
ANL <- ANL_1
ANL <- dplyr::full_join(ANL, ANL_2, by = c("id"))

With one single datasets iris, we create 2 ANLs and then combine them with dplyr::full_join. This is why we need keys even with a single dataset. CDISC datasets of course do not have this issue since they all have keys so the scenario here is when we have one or more non-cdisc datasets with no keys (for example iris, mtcars) and we don't want to merge them together. Here, two possible suggestions can be stated: 1) Add a dummy id column to the dataset with no key (possible already at the level of data_merge_srv) and hence continue the normal process of merging as done above. Given that this is feasible, the added dummy column will be present in the output in SRC as the user will need it to reproduce the example. I am not sure here if this is a healthy step to manipulate the data with an extra column compared to the input. 2) Refactor data_merge_module in such a way to handle cdisc and non-cdisc datasets with and without keys as the current data_merge_module was originally designed to handle cdisc data. Imagine the example below:

# teal app with 3 data extracts
data_extract_1 : dataset = ADSL, selected = SEX
data_extract_2 : dataset = ADAE, selected = RACE, ADAE_var1
data_extract_3: dataset = ADSL, selected = SEX, BMRKR1

#step 1: unify filters per dataset and not per data_extract(actual case)
unified_filter_ADSL_dataset: selected = SEX, BMRKR1 + keys/join_keys
unified_filter_ADAE_dataset: selected = RACE, ADAE_var1 + keys/join_keys

#step 2: merge the unified_filters datasets
merge: unified_filter_ADSL_dataset, unified_filter_ADAE_dataset

compared to the actual case how we do it:

# teal app with 3 data extracts
data_extract_1 : dataset = ADSL, selected = SEX
data_extract_2 : dataset = ADAE, selected = RACE, ADAE_var1
data_extract_3: dataset = ADSL, selected = SEX, BMRKR1

#step 1: filter per data_extract
filter1_ADSL_dataset: selected = SEX + keys/join_keys
filter2_ADAE_dataset: selected = RACE, ADAE_var1 + keys/join_keys
filter3_ADSL_dataset: selected = SEX, BMRKR1 + keys/join_keys

# step 2 merge in order:
unified_dataset1: filter1_ADSL_dataset + filter2_ADAE_dataset 
unified_dataset2: unified_dataset1_dataset +filter3_ADSL_dataset

This has the advantage of saving the consequent merging steps of unified_dataset. In the case of a single non-cdisc dataset with no keys, the filtered_dataset will be created in step 1 and thus we will not have 2 datasets to merge in step 2 but only 1 where optimally we won't need to go into the merging step at all as 1 dataframe is only needed. This suggestion has been stated in discussion with @nikolas-burkoff (who has more extreme suggestions if needed).

The separation of filtering and merging in such a way is an important step to improve the data_merge_module and make it less coupled as it is right now. As teal is not used much right now for non-cdisc data and thinking how to improve data_merge_module globally, I would consider a more serious refactor, given that I could not find a simple work-around so far. A discussion in the @insightsengineering/nest-core-dev team would be necessary to decide on the next steps, until then I will park this issue and come back to it with a fresher mind.

nikolas-burkoff commented 2 years ago

The way it is done now is bonkers - if you have a large dataset creating two or more versions of it to select some columns to then rejoin is so crazily inefficient...

I would consider a more serious refactor

One thing that changes between the two ways of doing things described above is the order of the joins - at the moment we order them by data extract order (which is obviously wrong - the order and type of join depends on the data model and what you want to get out of it not the order of the UI components - hence the complexity in the cross table letting you pick your join) and the other approach orders them in an even more complicated way depending on which extracts have which dataset (again wrong).

As with a lot of things in early NEST the requirement "now make it work with general relational data" was not properly specc-ed out and so what was implemented was the minimal required to say it was done - but what's there really only works with CDISC-like data:

Selecting columns from independent data sets without keys (as described above)
The hidden behaviour of joins - as a user of the app it's almost impossible to know what ANL is and to make sure it is what you want it to be - cf the outlier module the type and order of the join makes a difference as to what is an outlier but it's hidden from the users
The independence of the filter panel - for CDISC it works, filtering on ADSL cascades changes to other datasets (and other datasets can't filter on ADSL variables I think) but for general data it just doesn't each dataset is always treated independently which leads to really unintuitive behaviour - try it and see
In CDISC esp tmc it's generally straightforward what you want "join ADSL with your analysis dataset to get 'n'" and although we allow merging in tmg - the sample app is basically using the merge behaviour purely to select from a dataset so we barely use the merging properly at all

If it is a genuine requirement @kumamiao that NEST works for general relational data then we should have a proper think about what this actually means and re-design the codebase appropriately - see more general data visualization applications) but if in reality that's not actually needed then great, we don't need to consider it.

For example: you could argue it is not the responsibility of the module to perform the merge (the select/filter yes but not the merge) - and each module begins with a select your dataset and then all the extracts are just selectors/filters to generate the results you want. There would then be a separate "create ANL" part of the app which would allow you to merge datasets as you wish to create an ANL - in independent data this would not exist, in CDISC this would be straightforward and maybe even be done for you in a standard way [which you can as it is living with the data, not the module], and for complete relational data it could be much more complicated.

This again feeds into my belief that the modules should not contain logic which belongs to the data model (whether that's specifying columns for data extracts in tmc where the column is known if the data model is specified or handling generic merges of arbitrary data models)

But again, if general data is not important, then lets narrow our focus down to what it is...

kpagacz commented 2 years ago

Whatever users need - general or just cdisc - some auto-magical joins which are undocumented and virtually impossible to predict by an application user (they actually don't know what's happening and have no way of knowing except experimentation) are just a bad idea overall. Joins "on the fly" should be done manually by users, so they know what they are doing and can explain the result at the moment of performing the analyses instead of parsing the R code to understand what they actually did.

But I can imagine that experienced users could be fine with the current behaviour because they are used to it and burdening them with additional clicks can spur frustration.

gogonzo commented 2 years ago

@mhallal1 @kpagacz @nikolas-burkoff

Added blocked as data merge needs discussion, re-design and plan (issues) for the future. I will create a research issue soon

gogonzo commented 2 years ago

Duplicated by https://github.com/insightsengineering/teal.transform/issues/14

insightsengineering / teal.transform

Update data_merge_module to handle a dataset without keys #11