MattWellie commented 1 year ago

Fixes

closes #155
enables #154

Proposed Changes

@cassimons in particular, can you check that this implementation feels fit for purpose?

This is a pretty THICC change, with 2 Cs
Introduces the concept of filtering current results by removing previously seen results
Admittedly this all needs documenting, but the docs need a user-focused overhaul anyway so this will be done at that time

The config can now contain the key dataset_specific.historic_results - this will be the path to a cloud folder containing cumulative results files - this would be a candidate use of metamist's analysis objects if we wanted to tie AIP to CPG.
When AIP runs, the results will be cleaned against these prior results: a. The latest file in the historic_results folder will be identified using the stat().st_mtime attribute of the files (most recent modification time) b. This check doesn't fail if the directory is empty, but in that situation None will be returned instead of a file path c. If no file was found, the current data is saved in a minimal representation (See Below) into that folder, and all current results will reach the report d. If a file is returned, it is loaded as a dictionary, and the current data is filtered against it (See Below... again) e. The historic data is updated to include all new results not previously seen, and is saved back into the same folder (under a new file name, which will include the date/time to prevent write clashes) f. Filtered results are returned to be made into a report

I could write the results of any given run out before and after this filter is applied, if that might be of interest.

See Below

The historic data doesn't require all the variant details, so we only save 3 items per variant: coordinates (in chr-pos-ref-alt form), assigned categories, and variants supporting (comp-het). This is the only data written as the 'historic' results, which means manually adding prior external results could be prepared as a simple bit of JSON (probably not a likely use case). Each time a new reportable variant is seen, this data is extended - if we see the same variant but in a new category, the on-file cumulative representation only grows by ~1 Byte.

{
    'sampleID': {
        'chr-pos-ref-alt': {
            'categories': ['1', '4', ...],
            'support_vars': ['chr-pos-ref-alt', ...]
        },
        ...
   }
}

See Below... Again

When we filter current results against historic results, it goes something like this:

Sample: Have we seen variants for this sample previously? If not, keep everything
Variant: Have we seen this variant in this sample previously? If not, keep this variant in full
Partners: When we previously saw this variant - was it supported in a comp-het?
- if yes, but this time it's supported by a different comp-het pair, keep the whole variant and all categories
- if yes, and the same partner is seen again, go to checking categories
Categories: Comparing current results with the previous sighting of this variant:
- are there any new categories assigned? If so, keep only the new categories
- if not, we don't want to see this variant again - remove from the report

Checklist

[x] Related Issue created
[x] Tests covering new change
[x] Linting checks pass

MattWellie commented 1 year ago

I've just run this twice on the validation dataset (a run which had previously completed, so only the MOI/reporting stage ran)

first run the file gs://cpg-validation-test-analysis/reanalysis/historic_results/2022-12-01_04:57 was written, containing the only sample & variant present on the final report
running the exact same run again, the cumulative results were read, all variants were filtered out, and the report is now empty

cassimons commented 1 year ago

I think this logic is sound for alerting. However, an alternative final output I believe we should also support is exporting all variants but adding a field to each that records the date of the report in which they were first identified (for this category?). This means an analyst can a) quickly check for new variants but b) seamlessly progress to look at all previous variants.

I think your existing historic_results format would work fine for this, just by adding the date_reported field/s.

We could also think about adding this date-based recording to our gold stars - not exactly sure how that would work but it is worth considering.

MattWellie commented 1 year ago

@cassimons I've altered the logic a bit, so that there is an associated 'date of first discovery' with each category in the cumulative results. Currently this will just be stored in the historic results file, in future (i.e. when the report supports sorting/filtering within the results display) it will be easy enough to annotate each of the variants from this AIP run with the date it was previously found instead of just doing a blanket removal of anything previously seen. - or doing both and writing two outputs. In either case, this PR will start writing date-tagged variant results into the results folder, and we can incorporate that however we like in future

cassimons commented 1 year ago

Awesome thankyou!

populationgenomics / automated-interpretation-pipeline

Filter to new results #164

Fixes

Proposed Changes

See Below

See Below... Again

Checklist