STRipy stage fixes and improvements

EddieLF commented 5 months ago

In a recent batch of Stripy stage jobs, the expected outputs were created:

  /io/batch/2d6716/STRipy-EqSXp/out_path: 1 files, 43.1 MB
  /io/batch/2d6716/STRipy-EqSXp/log_path: 1 files, 2.2 kB
  /io/batch/2d6716/STRipy-EqSXp/json_path: 1 files, 42.7 MB

[ ] Pipeline Stage Bug with writing Analysis: The analysis meta was missing the outlier_loci, outliers_detected, andlog_path fields, despite being In the update_meta call passed to the stage decorator. https://github.com/populationgenomics/production-pipelines/blob/main/cpg_workflows/stages/stripy.py#L24
[ ] Analysis meta field improvement: The json path should be recorded in the analysis output or meta field alongside the log_path and stripy_html file.
[ ] Storage optimisation: The json file should be pruned, it has a few 100kb of valuable summary data, and a few dozen mb of SVG blob data inside the json. If the SVG data is not important, which I suspect it isn't because these files are not accessed after the html is created, then we can remove those fields before writing the json.
[ ] Data analysis and report improvements: The json file holds key summary data that could be used to better filter the reports.
E.g.
- Flagging entries where IsPopulationOutlier = true or where Flag = "1", "2" or "3".
- A frequency histogram of the Repeats value for each gene where Filter = "PASS"

Thanks for these suggestions @SamBryen.

EddieLF commented 5 months ago

A simple function to remove the SVG blobs from the JSON (thanks Claude 3!)

import json

def remove_svg_keys(json_data):
    if isinstance(json_data, dict):
        return {key: remove_svg_keys(value) for key, value in json_data.items() if key != "SVG"}
    elif isinstance(json_data, list):
        return [remove_svg_keys(item) for item in json_data]
    else:
        return json_data

# Read the JSON file
with open("testfile.stripy.json", "r") as file:
    data = json.load(file)

# Remove the "SVG" key-value pairs
modified_data = remove_svg_keys(data)

# Write the modified JSON to a new file
with open("output.json", "w") as file:
    json.dump(modified_data, file, indent=4)

SamBryen commented 5 months ago

Thanks Ed!

Currently, the only way to look at this STRipy data is to open each HTML file individually and look at the report manually. This is fine when analysing families as a case by case basis, but if an analysts wants to know if there are any positive hits for a particular loci in a cohort or across all cohorts, they would need to open and look through every single HTML to find out, and each HTML file is quite slow to load.

Highlighting which samples have positive hits in which genes in a summary page would allow us to streamline which reports we open and would save a lot of time. Having frequency histograms would give us a sense for what is normal in our cohort and what is likely to be artefactual.

EddieLF commented 4 months ago

Another idea from some of our collaborators: a multisample STRipy summary report for a cohort.

Each column of the report would be a particular locus analysed by STRipy. Then, we have one row per sample. The data cells will contain the STRipy "values" for that sample at that locus. This way, we can see the STRipy analysis over the entire cohort and filter/colour cells based on importance.

Note from Cas:

We will need to build full callset distributions so you can see what are real outliers. Within a data set it will be tempting to say “hey look at these three samples with expansions in STR X, perhaps they are a cluster” when in real life we find 5% of all individuals have similar scores.

populationgenomics / production-pipelines

STRipy stage fixes and improvements #757