populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
6 stars 1 forks source link

STRipy stage fixes and improvements #757

Open EddieLF opened 5 months ago

EddieLF commented 5 months ago

In a recent batch of Stripy stage jobs, the expected outputs were created:

  /io/batch/2d6716/STRipy-EqSXp/out_path: 1 files, 43.1 MB
  /io/batch/2d6716/STRipy-EqSXp/log_path: 1 files, 2.2 kB
  /io/batch/2d6716/STRipy-EqSXp/json_path: 1 files, 42.7 MB

Thanks for these suggestions @SamBryen.

EddieLF commented 5 months ago

A simple function to remove the SVG blobs from the JSON (thanks Claude 3!)

import json

def remove_svg_keys(json_data):
    if isinstance(json_data, dict):
        return {key: remove_svg_keys(value) for key, value in json_data.items() if key != "SVG"}
    elif isinstance(json_data, list):
        return [remove_svg_keys(item) for item in json_data]
    else:
        return json_data

# Read the JSON file
with open("testfile.stripy.json", "r") as file:
    data = json.load(file)

# Remove the "SVG" key-value pairs
modified_data = remove_svg_keys(data)

# Write the modified JSON to a new file
with open("output.json", "w") as file:
    json.dump(modified_data, file, indent=4)
SamBryen commented 5 months ago

Thanks Ed!

Currently, the only way to look at this STRipy data is to open each HTML file individually and look at the report manually. This is fine when analysing families as a case by case basis, but if an analysts wants to know if there are any positive hits for a particular loci in a cohort or across all cohorts, they would need to open and look through every single HTML to find out, and each HTML file is quite slow to load.

Highlighting which samples have positive hits in which genes in a summary page would allow us to streamline which reports we open and would save a lot of time. Having frequency histograms would give us a sense for what is normal in our cohort and what is likely to be artefactual.

EddieLF commented 4 months ago

Another idea from some of our collaborators: a multisample STRipy summary report for a cohort.

Each column of the report would be a particular locus analysed by STRipy. Then, we have one row per sample. The data cells will contain the STRipy "values" for that sample at that locus. This way, we can see the STRipy analysis over the entire cohort and filter/colour cells based on importance.

Note from Cas:

We will need to build full callset distributions so you can see what are real outliers. Within a data set it will be tempting to say “hey look at these three samples with expansions in STR X, perhaps they are a cluster” when in real life we find 5% of all individuals have similar scores.