populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Filter to new results #164

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

Fixes

Proposed Changes

@cassimons in particular, can you check that this implementation feels fit for purpose?

  1. The config can now contain the key dataset_specific.historic_results - this will be the path to a cloud folder containing cumulative results files - this would be a candidate use of metamist's analysis objects if we wanted to tie AIP to CPG.
  2. When AIP runs, the results will be cleaned against these prior results: a. The latest file in the historic_results folder will be identified using the stat().st_mtime attribute of the files (most recent modification time) b. This check doesn't fail if the directory is empty, but in that situation None will be returned instead of a file path c. If no file was found, the current data is saved in a minimal representation (See Below) into that folder, and all current results will reach the report d. If a file is returned, it is loaded as a dictionary, and the current data is filtered against it (See Below... again) e. The historic data is updated to include all new results not previously seen, and is saved back into the same folder (under a new file name, which will include the date/time to prevent write clashes) f. Filtered results are returned to be made into a report

I could write the results of any given run out before and after this filter is applied, if that might be of interest.

See Below

The historic data doesn't require all the variant details, so we only save 3 items per variant: coordinates (in chr-pos-ref-alt form), assigned categories, and variants supporting (comp-het). This is the only data written as the 'historic' results, which means manually adding prior external results could be prepared as a simple bit of JSON (probably not a likely use case). Each time a new reportable variant is seen, this data is extended - if we see the same variant but in a new category, the on-file cumulative representation only grows by ~1 Byte.

{
    'sampleID': {
        'chr-pos-ref-alt': {
            'categories': ['1', '4', ...],
            'support_vars': ['chr-pos-ref-alt', ...]
        },
        ...
   }
}

See Below... Again

When we filter current results against historic results, it goes something like this:

  1. Sample: Have we seen variants for this sample previously? If not, keep everything
  2. Variant: Have we seen this variant in this sample previously? If not, keep this variant in full
  3. Partners: When we previously saw this variant - was it supported in a comp-het?
    • if yes, but this time it's supported by a different comp-het pair, keep the whole variant and all categories
    • if yes, and the same partner is seen again, go to checking categories
  4. Categories: Comparing current results with the previous sighting of this variant:
    • are there any new categories assigned? If so, keep only the new categories
    • if not, we don't want to see this variant again - remove from the report

Checklist

MattWellie commented 1 year ago

I've just run this twice on the validation dataset (a run which had previously completed, so only the MOI/reporting stage ran)

cassimons commented 1 year ago

I think this logic is sound for alerting. However, an alternative final output I believe we should also support is exporting all variants but adding a field to each that records the date of the report in which they were first identified (for this category?). This means an analyst can a) quickly check for new variants but b) seamlessly progress to look at all previous variants.

I think your existing historic_results format would work fine for this, just by adding the date_reported field/s.

We could also think about adding this date-based recording to our gold stars - not exactly sure how that would work but it is worth considering.

MattWellie commented 1 year ago

@cassimons I've altered the logic a bit, so that there is an associated 'date of first discovery' with each category in the cumulative results. Currently this will just be stored in the historic results file, in future (i.e. when the report supports sorting/filtering within the results display) it will be easy enough to annotate each of the variants from this AIP run with the date it was previously found instead of just doing a blanket removal of anything previously seen. - or doing both and writing two outputs. In either case, this PR will start writing date-tagged variant results into the results folder, and we can incorporate that however we like in future

cassimons commented 1 year ago

Awesome thankyou!