stefinfection / RUFUS

RUFUS k-mer based genomic variant detection
0 stars 0 forks source link

Investigate variants that RUFUS didn't call from gold-standard dataset - specifically those that had contigs made but no variant calls #13

Open stefinfection opened 9 months ago

stefinfection commented 9 months ago

Steps to reproduce variants not called by RUFUS that are in the gold-standard data set: (Notes from S. Gardiner)

Files to reproduce:

So, I've been looking at the RUFUS run from the merged bams of EA/NC/LL with this file path: /scratch/ucgd/lustre-work/marth/u0880188/smaht/hcc1395_seqc2/merged_runs/EA_NC_LL

How to identify variants that do have contigs made, but don't have calls in vcf:

But, the way I did it previously was I ran bedtools coverage on this file that contains contigs: /scratch/ucgd/lustre-work/marth/u0880188/smaht/hcc1395_seqc2/merged_runs/EA_NC_LL/EA_NC_LL_3_merged_tumor.bam.generator.V2.overlap.hashcount.fastq.bam at the specific sites from the validated vcf that RUFUS failed to call. This gave me this file: /scratch/ucgd/lustre-work/marth/u0880188/smaht/hcc1395_seqc2/merged_runs/EA_NC_LL/contig_depth.txt Where i then just used a python script to pull out locations that had at least a coverage of 1. Here is a tsv file of that: /scratch/ucgd/lustre-work/marth/u0880188/smaht/hcc1395_seqc2/merged_runs/EA_NC_LL/contig_variants_validated.tsv

Looks like there were 1209 variants

stefinfection commented 7 months ago

This has been updated to ~200 variants after normalizing/decomposing the RUFUS vcf in comparison to the SeqC2 one. Deprioritizing this for now