populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

STRipy stage fixes and improvements #757

Open EddieLF opened 1 month ago

EddieLF commented 1 month ago

In a recent batch of Stripy stage jobs, the expected outputs were created:

  /io/batch/2d6716/STRipy-EqSXp/out_path: 1 files, 43.1 MB
  /io/batch/2d6716/STRipy-EqSXp/log_path: 1 files, 2.2 kB
  /io/batch/2d6716/STRipy-EqSXp/json_path: 1 files, 42.7 MB

Thanks for these suggestions @SamBryen.

EddieLF commented 1 month ago

A simple function to remove the SVG blobs from the JSON (thanks Claude 3!)

import json

def remove_svg_keys(json_data):
    if isinstance(json_data, dict):
        return {key: remove_svg_keys(value) for key, value in json_data.items() if key != "SVG"}
    elif isinstance(json_data, list):
        return [remove_svg_keys(item) for item in json_data]
    else:
        return json_data

# Read the JSON file
with open("testfile.stripy.json", "r") as file:
    data = json.load(file)

# Remove the "SVG" key-value pairs
modified_data = remove_svg_keys(data)

# Write the modified JSON to a new file
with open("output.json", "w") as file:
    json.dump(modified_data, file, indent=4)
SamBryen commented 1 month ago

Thanks Ed!

Currently, the only way to look at this STRipy data is to open each HTML file individually and look at the report manually. This is fine when analysing families as a case by case basis, but if an analysts wants to know if there are any positive hits for a particular loci in a cohort or across all cohorts, they would need to open and look through every single HTML to find out, and each HTML file is quite slow to load.

Highlighting which samples have positive hits in which genes in a summary page would allow us to streamline which reports we open and would save a lot of time. Having frequency histograms would give us a sense for what is normal in our cohort and what is likely to be artefactual.