Closed MattWellie closed 1 year ago
I've just run this twice on the validation dataset (a run which had previously completed, so only the MOI/reporting stage ran)
gs://cpg-validation-test-analysis/reanalysis/historic_results/2022-12-01_04:57
was written, containing the only sample & variant present on the final reportI think this logic is sound for alerting. However, an alternative final output I believe we should also support is exporting all variants but adding a field to each that records the date of the report in which they were first identified (for this category?). This means an analyst can a) quickly check for new variants but b) seamlessly progress to look at all previous variants.
I think your existing historic_results format would work fine for this, just by adding the date_reported field/s.
We could also think about adding this date-based recording to our gold stars - not exactly sure how that would work but it is worth considering.
@cassimons I've altered the logic a bit, so that there is an associated 'date of first discovery' with each category in the cumulative results. Currently this will just be stored in the historic results file, in future (i.e. when the report supports sorting/filtering within the results display) it will be easy enough to annotate each of the variants from this AIP run with the date it was previously found instead of just doing a blanket removal of anything previously seen. - or doing both and writing two outputs. In either case, this PR will start writing date-tagged variant results into the results folder, and we can incorporate that however we like in future
Awesome thankyou!
Fixes
Proposed Changes
@cassimons in particular, can you check that this implementation feels fit for purpose?
dataset_specific.historic_results
- this will be the path to a cloud folder containing cumulative results files - this would be a candidate use of metamist'sanalysis
objects if we wanted to tie AIP to CPG.historic_results
folder will be identified using thestat().st_mtime
attribute of the files (most recent modification time) b. This check doesn't fail if the directory is empty, but in that situationNone
will be returned instead of a file path c. If no file was found, the current data is saved in a minimal representation (See Below
) into that folder, and all current results will reach the report d. If a file is returned, it is loaded as a dictionary, and the current data is filtered against it (See Below... again
) e. The historic data is updated to include all new results not previously seen, and is saved back into the same folder (under a new file name, which will include the date/time to prevent write clashes) f. Filtered results are returned to be made into a reportI could write the results of any given run out before and after this filter is applied, if that might be of interest.
See Below
The historic data doesn't require all the variant details, so we only save 3 items per variant: coordinates (in chr-pos-ref-alt form), assigned categories, and variants supporting (comp-het). This is the only data written as the 'historic' results, which means manually adding prior external results could be prepared as a simple bit of JSON (probably not a likely use case). Each time a new reportable variant is seen, this data is extended - if we see the same variant but in a new category, the on-file cumulative representation only grows by ~1 Byte.
See Below... Again
When we filter current results against historic results, it goes something like this:
this
variant in this sample previously? If not, keep this variant in fullChecklist