Closed MattWellie closed 1 month ago
FWIW, the analysis object can already be linked against multiple cohort_ids, so also best to avoid putting these in the meta: https://github.com/populationgenomics/metamist/blob/4ae0883d5031f9d49d3edecce24e4f887e1d431e/models/models/analysis.py#L98-L106
@MattWellie thanks for the writeup. Although it's convenient to access the sequencing groups associated with an analysis by looking at analysis.meta.sequencing_groups
- which is usually just a list of strings - this field is mutable and risks not being accurate when compared to retrieving the sequencing groups associated with an analysis directly via the analysis_sequencing_group
table.
The difference can be seen in some GraphQL queries. First method is reading the sequencing groups from the analysis meta:
query MyQuery {
project(name: "my-dataset") {
analyses {
id
type
meta
}
}
}
...
{
"id": 123,
"type": "qc",
"meta": {
"sequencing_groups": [
"CPG12345",
"CPG23456",
"CPG34567",
]
}
},
...
Compared to directly accessing the sequencing groups associated with the analysis, instead of reading them from the meta:
query MyQuery {
project(name: "my-dataset") {
analyses {
id
type
sequencingGroups {
id
}
}
}
}
...
{
"id": 123,
"type": "qc",
"sequencingGroups": [
{"id": "CPG12345"},
{"id": "CPG23456"},
{"id": "CPG34567"},
]
}
},
...
The second way is more reliable and true, although sadly you have to parse out the IDs rather than getting a nice list. If we plan to totally do away with the first method, i.e. no more sg IDs in meta, then some scripts & modules across the codebases will need to be updated to get IDs via the second method.
I think that from a 'no longer generate huge clunky data' perspective this is solved. @illusional was planning to discard all the previous weighty meta from existing analyses, but that's not directly relevant to this issue.
Slack: https://centrepopgen.slack.com/archives/C018KFBCR1C/p1714624645731339
TL; DR
We're shoving a lot of stuff into analysis entry
meta
, and it's making metamist really unhappyHefty Analyses
@illusional has identified some MASSIVE analysis entries. Part of the reason appears to be sections of code where we transcribe entire QC reports into the analysis meta:
So... we should stop doing that...
SG IDs
We take all the SG IDs relevant to an analysis and write them into the analysis meta. This is done for all Cohort and Dataset stages:
These methods are called as part of the analysis entry creation process: here
So... stop doing that as well. If we need SG IDs to be present, there is already a top-level field
sequencing_group_ids
designed to hold that information. Once we fully adopt custom cohorts we can accomplish this using the Custom Cohort ID/Name, so that these will be super lightweight. The Analysis entry already has an attribute ready for cohort_ids: here