Closed kpagacz closed 2 years ago
So the issue seems to be your dataset doesn't have a key for the merge - this works:
library(teal.modules.general)
iris_with_keys <- iris
iris_with_keys$id <- seq_len(nrow(iris))
app <- init(
data = teal_data(dataset("iris", iris_with_keys, keys = "id")),
modules = list(
tm_g_distribution(
dist_var = data_extract_spec(
dataname = "iris",
select = select_spec(variable_choices("iris"), "Petal.Length")
),
strata_var = data_extract_spec(
dataname = "iris",
filter = filter_spec(vars = choices_selected(variable_choices("iris"), "Species"), multiple = T)
)
)
)
)
shinyApp(app$ui, app$server)
A couple of thoughts @pawelru , @kpagacz
_teal_id
) and set it as a primary key so modules like this work? Regarding your points - I had an issue or a comment with these exact thoughts. I cannot find it right now, but I threw an idea that we might want to add a dummy key column for a single dataset if no default is provided because the description of the key parameter doesn't lead to believe that it is needed for a single dataset.
Thanks guys for the ticket. I agree that in general the app should handle such case without the need of specifying the keys but I can't accept a solution in which user needs to create artificial column in order to perform merge to self. This should be "invisibly" handled my merge module so that there is no new column in the output. We can temporarily create it but we need to remove it afterwards. Anyway, that seems to be a new feature thus moving to the main backlog
Possibly needed a transfer to other repo and title update.
I agree with @polkas - update title, transfer - are you working on something else, @nikolas-burkoff ?
Nope - I've unassigned
@mhallal1 The same question as to Nik - what are your plans? And if you want to unassign later, would you mind renaming this issue and putting it in the backlog?
@kpagacz the title and repo have been updated. I am assessing the range of this issue as it touches the main concept of merging datasets that we follow now.
So the basic way that we handle an example with one non-cdisc dataset with keys (Nik's example above) is as follows:
ANL_1 <- iris_FILTERED %>% dplyr::select(id, Petal.Length)
ANL_2 <- iris_FILTERED %>% dplyr::filter(Species == "setosa") %>% dplyr::select(id, Species)
ANL <- ANL_1
ANL <- dplyr::full_join(ANL, ANL_2, by = c("id"))
With one single datasets iris
, we create 2 ANL
s and then combine them with dplyr::full_join
. This is why we need keys
even with a single dataset. CDISC datasets of course do not have this issue since they all have keys so the scenario here is when we have one or more non-cdisc datasets with no keys (for example iris
, mtcars
) and we don't want to merge them together.
Here, two possible suggestions can be stated:
1) Add a dummy id column to the dataset with no key (possible already at the level of data_merge_srv
) and hence continue the normal process of merging as done above. Given that this is feasible, the added dummy column will be present in the output in SRC as the user will need it to reproduce the example. I am not sure here if this is a healthy step to manipulate the data with an extra column compared to the input.
2) Refactor data_merge_module
in such a way to handle cdisc and non-cdisc datasets with and without keys as the current data_merge_module
was originally designed to handle cdisc data.
Imagine the example below:
# teal app with 3 data extracts
data_extract_1 : dataset = ADSL, selected = SEX
data_extract_2 : dataset = ADAE, selected = RACE, ADAE_var1
data_extract_3: dataset = ADSL, selected = SEX, BMRKR1
#step 1: unify filters per dataset and not per data_extract(actual case)
unified_filter_ADSL_dataset: selected = SEX, BMRKR1 + keys/join_keys
unified_filter_ADAE_dataset: selected = RACE, ADAE_var1 + keys/join_keys
#step 2: merge the unified_filters datasets
merge: unified_filter_ADSL_dataset, unified_filter_ADAE_dataset
compared to the actual case how we do it:
# teal app with 3 data extracts
data_extract_1 : dataset = ADSL, selected = SEX
data_extract_2 : dataset = ADAE, selected = RACE, ADAE_var1
data_extract_3: dataset = ADSL, selected = SEX, BMRKR1
#step 1: filter per data_extract
filter1_ADSL_dataset: selected = SEX + keys/join_keys
filter2_ADAE_dataset: selected = RACE, ADAE_var1 + keys/join_keys
filter3_ADSL_dataset: selected = SEX, BMRKR1 + keys/join_keys
# step 2 merge in order:
unified_dataset1: filter1_ADSL_dataset + filter2_ADAE_dataset
unified_dataset2: unified_dataset1_dataset +filter3_ADSL_dataset
This has the advantage of saving the consequent merging steps of unified_dataset
.
In the case of a single non-cdisc dataset with no keys, the filtered_dataset will be created in step 1 and thus we will not have 2 datasets to merge in step 2 but only 1 where optimally we won't need to go into the merging step at all as 1 dataframe is only needed. This suggestion has been stated in discussion with @nikolas-burkoff (who has more extreme suggestions if needed).
The separation of filtering and merging in such a way is an important step to improve the data_merge_module
and make it less coupled as it is right now. As teal
is not used much right now for non-cdisc data and thinking how to improve data_merge_module
globally, I would consider a more serious refactor, given that I could not find a simple work-around so far.
A discussion in the @insightsengineering/nest-core-dev team would be necessary to decide on the next steps, until then I will park this issue and come back to it with a fresher mind.
Related issues: https://github.com/insightsengineering/teal.devel/issues/108
The way it is done now is bonkers - if you have a large dataset creating two or more versions of it to select some columns to then rejoin is so crazily inefficient...
I would consider a more serious refactor
One thing that changes between the two ways of doing things described above is the order of the joins - at the moment we order them by data extract order (which is obviously wrong - the order and type of join depends on the data model and what you want to get out of it not the order of the UI components - hence the complexity in the cross table letting you pick your join) and the other approach orders them in an even more complicated way depending on which extracts have which dataset (again wrong).
As with a lot of things in early NEST the requirement "now make it work with general relational data" was not properly specc-ed out and so what was implemented was the minimal required to say it was done - but what's there really only works with CDISC-like data:
If it is a genuine requirement @kumamiao that NEST works for general relational data then we should have a proper think about what this actually means and re-design the codebase appropriately - see more general data visualization applications) but if in reality that's not actually needed then great, we don't need to consider it.
For example: you could argue it is not the responsibility of the module to perform the merge (the select/filter yes but not the merge) - and each module begins with a select your dataset and then all the extracts are just selectors/filters to generate the results you want. There would then be a separate "create ANL" part of the app which would allow you to merge datasets as you wish to create an ANL - in independent data this would not exist, in CDISC this would be straightforward and maybe even be done for you in a standard way [which you can as it is living with the data, not the module], and for complete relational data it could be much more complicated.
This again feeds into my belief that the modules should not contain logic which belongs to the data model (whether that's specifying columns for data extracts in tmc where the column is known if the data model is specified or handling generic merges of arbitrary data models)
But again, if general data is not important, then lets narrow our focus down to what it is...
Whatever users need - general or just cdisc - some auto-magical joins which are undocumented and virtually impossible to predict by an application user (they actually don't know what's happening and have no way of knowing except experimentation) are just a bad idea overall. Joins "on the fly" should be done manually by users, so they know what they are doing and can explain the result at the moment of performing the analyses instead of parsing the R code to understand what they actually did.
But I can imagine that experienced users could be fine with the current behaviour because they are used to it and burdening them with additional clicks can spur frustration.
@mhallal1 @kpagacz @nikolas-burkoff
Added blocked as data merge needs discussion, re-design and plan (issues) for the future. I will create a research issue soon
Code to reproduce:
The error:
Expected behaviour: the Species variable is passed to the module and no error is retured.