sanger / sequencescape

Web based LIMS
MIT License
80 stars 32 forks source link

Y24-086 As NPG we would like an extra "Contaminated Human Data Access Group" field on the New Study Page so that access to the "likely human" read data that NPG already creates separately. #4115

Open TWJW-SANGER opened 1 month ago

TWJW-SANGER commented 1 month ago

User story As NPG we would like an extra "Contaminated Human Data Access Group" field on the New Study Page so that access to the "likely human" read data that NPG already creates separately.

Who are the primary contacts for this story Jillian D David J SSR team

Who is the nominated tester for UAT The SSR team (Liz C)

Acceptance criteria To be considered successful the solution must allow:

Additional context This story has been generated from RT#801935.

When processing non-human samples that have been extracted from humans, e.g. Covid virus extracted from human saliva, the sequence data files can contain fragments of human DNA/RNA in addition to the viral or parasitic DNA/RNA.

Studies producing non-human sequence data DO NOT have the ethical approval or legal consent to publish any human sequence data. NPG offers a service to remove human derived DNA/RNA fragments from the data output of the sequencing machine.

The existing Data Access Group controls access to the data set with human reads removed.

There is a need for different, more privileged, user groups to access to the data sets containing human reads in order to perform quality checks.

dkj commented 1 month ago

~Hold until we revise the current description, please.~ [response added in next comment now]

dkj commented 1 month ago

User story
As NPG we would like an extra "Contaminated Data Access Group" field on the New Study Page so that we can control access to non-human sequence data that contains reads from contaminating human DNA.

Alas, I find that description confusing too...

Recommend: 'As NPG we would like an extra "Contaminated human data access group" field on Study pages so that DNAP SSRs (and NPG in some rarer circumstances) can control access to the "likely human" read data NPG already creates separately to the main data product when the existing study field "Does this study contain samples that are contaminated with human DNA which must be removed prior to analysis?" is marked "yes". There should be a new contaminated_human_data_access_group field in the MLWH study table which propagates the content of this new string field. This nomenclature matches that already in use e.g. the contaminated_human_dna which reflects the "contain samples that are contaminated" field and determines whether NPG split the data, and data_access_group which determines access to the main data product.'

  • Change the text on the New Studies page from "Do any of the samples in this study contain human DNA?" to "Does the final data set contain human DNA?" to clearly indicate what is being referred to (samples or data)

No! Wrong field - I don't think NPG use that for anything - I guess this was/is to help with assessing compliance - maybe data release team use it?

  • Change the text under the New Studies page Data Access group field from "This field helps control access to product data in the iRODs seq zone." to "This field helps control access to the final product data in the iRODs seq zone."

No - I don't think this helps as we don't have a concept of "final data product", just "data product" (it might make sense if the diagram was a good representation but that doesn't represent what we do for Illumina at the moment - although it could represent a path for other platforms or future processing).

  • Add an additional field "Contaminated Data access group" with instructions that this allows the specified group access to the data containing human sequence reads.

Prefer "Contaminated human data access group" with explanatory "allows the specified Unix groups and users access to data separated from out from the main data product as likely contaminated with human - typically rarely used and typically only used for validating the separation process as we may not have the ethical or legal authorisation to use it beyond that"

  • The "Contaminated Data access group" value should be exported to the MLWH database as an extra column in the study table contaminated_data_access_group as VARCHAR(255) defaulting to NULL.

contaminated_human_data_access_group fits better with existing MLWH study fields.

The overall process then looks like:

Ah no - more like:

Sequencer data --- NPG processing -----> Sample 1 -- i -> target1.cram  (data_access_group governed access)
                                   \.             \- R -> human1.cram   (new contaminated_human_data_access_group governed access)
                                     \-> Sample 2 -- O -> target2.cram
                                                  \- D -> human1.cram
                                                     S

The existing questions on the New Studies page are ambiguous. They can be read as does the sample provided contain human DNA, rather than should the final data set contain human DNA. The acceptance criteria attempts to make this clearer.

Like the idea! We have to be super careful... Also MLWH comments on the field should perhaps match any descriptive help.