populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Remove hefty analysis meta #717

Closed MattWellie closed 1 month ago

MattWellie commented 2 months ago

Slack: https://centrepopgen.slack.com/archives/C018KFBCR1C/p1714624645731339

TL; DR

We're shoving a lot of stuff into analysis entry meta, and it's making metamist really unhappy

Hefty Analyses

@illusional has identified some MASSIVE analysis entries. Part of the reason appears to be sections of code where we transcribe entire QC reports into the analysis meta:

So... we should stop doing that...

SG IDs

We take all the SG IDs relevant to an analysis and write them into the analysis meta. This is done for all Cohort and Dataset stages:

These methods are called as part of the analysis entry creation process: here

So... stop doing that as well. If we need SG IDs to be present, there is already a top-level field sequencing_group_ids designed to hold that information. Once we fully adopt custom cohorts we can accomplish this using the Custom Cohort ID/Name, so that these will be super lightweight. The Analysis entry already has an attribute ready for cohort_ids: here

illusional commented 2 months ago

FWIW, the analysis object can already be linked against multiple cohort_ids, so also best to avoid putting these in the meta: https://github.com/populationgenomics/metamist/blob/4ae0883d5031f9d49d3edecce24e4f887e1d431e/models/models/analysis.py#L98-L106

image
EddieLF commented 2 months ago

@MattWellie thanks for the writeup. Although it's convenient to access the sequencing groups associated with an analysis by looking at analysis.meta.sequencing_groups - which is usually just a list of strings - this field is mutable and risks not being accurate when compared to retrieving the sequencing groups associated with an analysis directly via the analysis_sequencing_group table.

The difference can be seen in some GraphQL queries. First method is reading the sequencing groups from the analysis meta:

query MyQuery {
  project(name: "my-dataset") {
    analyses {
      id
      type
      meta
    }
  }
}
...
        {
          "id": 123,
          "type": "qc",
          "meta": {
            "sequencing_groups": [
              "CPG12345",
              "CPG23456",
              "CPG34567",
            ]
          }
        },
...

Compared to directly accessing the sequencing groups associated with the analysis, instead of reading them from the meta:

query MyQuery {
  project(name: "my-dataset") {
    analyses {
      id
      type
      sequencingGroups {
        id
      }
    }
  }
}
...
        {
          "id": 123,
          "type": "qc",
           "sequencingGroups": [
              {"id": "CPG12345"},
              {"id": "CPG23456"},
              {"id": "CPG34567"},
            ]
          }
        },
...

The second way is more reliable and true, although sadly you have to parse out the IDs rather than getting a nice list. If we plan to totally do away with the first method, i.e. no more sg IDs in meta, then some scripts & modules across the codebases will need to be updated to get IDs via the second method.

MattWellie commented 1 month ago

I think that from a 'no longer generate huge clunky data' perspective this is solved. @illusional was planning to discard all the previous weighty meta from existing analyses, but that's not directly relevant to this issue.