[Question]: Does the `default_cdisc_join_keys` contain exhaustive list of CDISC datasets?

vedhav commented 10 months ago

What is your question?

When testing about default_cdisc_join_keys along with the scda datasets. I was unable to find the join keys for c("ADAB" "ADPC" "ADPP" "ADTR") in the default_cdisc_join_keys however they were present in scda. There were also additional join keys in the default_cdisc_join_keys c("ADSAFTTE" "ADCSSRS" "ADEQ5D5L") which were missing in the scda datasets.

I thought that CDISC datasets formats contain an exhaustive list of datasets (at least for a given version of the SDTM). My question is do we need to extend the default_cdisc_join_keys to include the missing datasets from scda? Perhaps also add all the available datasets in the default_cdisc_join_keys into scda.

join_key_datasets <- default_cdisc_join_keys |> names()
latest_scda_names <- scda::synthetic_cdisc_data("latest") |> names() |> toupper()

setdiff(latest_scda_names, join_key_datasets)
# [1] "ADAB" "ADPC" "ADPP" "ADTR"
setdiff(join_key_datasets, latest_scda_names)
# [1] "ADSAFTTE" "ADCSSRS"  "ADEQ5D5L"

Code of Conduct

[X] I agree to follow this project's Code of Conduct.

Contribution Guidelines

[X] I agree to follow this project's Contribution Guidelines.

Security Policy

[X] I agree to follow this project's Security Policy.

donyunardi commented 10 months ago

[Question]: Does the default_cdisc_join_keys contain exhaustive list of CDISC datasets?

No it doesn't and I don't think we should maintain this exhaustive list.

Analysis datasets are named using the ADXXXX convention, where the XXXX portion is sponsor-defined and created depending on the product. As CDISC continues to evolve, it's too laborious to always have to keep up with the new convention.

At the very least, we should cover the common ones, and upon a quick glance, I felt we have already done this: https://github.com/insightsengineering/teal.data/blob/main/inst/cdisc_datasets/cdisc_datasets.yaml

@lcd2yyz Can I get your opinion on this?

lcd2yyz commented 10 months ago

@donyunardi Great explanation! Confirm it's correct.

I actually feel we should maybe remove some from the list, because they are sponsor-defined dataset names, as opposed to common datasets names outlined in CDISC standards or ADaM implementation guides. For examples, ADAETTE, ADQLQC, ADCSSRS, ADEQ5D5L.

@khatril @shajoezhu @crazycatandy @telepath37 Can I get you opinion on the suggestion to drop these sponsor-defined datasets?

telepath37 commented 10 months ago

@donyunardi Great explanation! Confirm it's correct.

I actually feel we should maybe remove some from the list, because they are sponsor-defined dataset names, as opposed to common datasets names outlined in CDISC standards or ADaM implementation guides. For examples, ADAETTE, ADQLQC, ADCSSRS, ADEQ5D5L.

@khatril @shajoezhu @crazycatandy @telepath37 Can I get you opinion on the suggestion to drop these sponsor-defined datasets?

I agree - we should just keep the very common ADaM datasets in our list ("defaults") and allow users to define keys on their ADXXXX datasets if they want.

shajoezhu commented 10 months ago

Thanks @lcd2yyz and @donyunardi

I also agree that we should trim this list, and keep this to minimal. the standards and implementation changes all the time, if it puts too much restirction checks, it is less user-friendly

khatril commented 10 months ago

Thanks for the discussion and for putting it on our radar, I'm also in agreement to trim these back to the common datasets only and allow users the flexibility

insightsengineering / teal.data