Open glennhickey opened 5 years ago
Hi @glennhickey , This issue seems to apply to lower coverage as well, were evident read errors in haploids are incorporated, and memory/compute wasted. Strange that I am seeing this flag (-m) now available in vg augment (vg version v1.53.0 "Valmontone") but, when used, it does not seem to have an effect:
Reads calls on un-augmented base graph: genome 26 >12>16 T C,G 691.417 PASS AT=>12>13>16,>12<14>16,>12<15>16;DP=30 GT:DP:AD:GL:GQ:GP:XD:MAD 0:30:30,0,0:-1.214166,-70.179910,-70.179910:256:-1.098612:28.840000:30 genome 28 >16>19 A G 715.655 PASS AT=>16>17>19,>16<18>19;DP=31 GT:DP:AD:GL:GQ:GP:XD:MAD 0:31:31,0:-1.247709,-72.512312:256:-0.693147:28.840000:31
Reads on augmented graph (-m 10 used): genome 3 >1>2 A AG 531.747 PASS AT=>1>2,>1>1587096>2;DP=25 GT:DP:AD:GL:GQ:GP:XD:MAD 0:25:24,1:-2.173481,-55.047219:256:-0.693147:28.879999:24 genome 7 >3>4 T TG 577.724 PASS AT=>3>4,>3>1587095>4;DP=27 GT:DP:AD:GL:GQ:GP:XD:MAD 0:27:26,1:-2.069554,-59.541008:256:-0.693147:28.879999:26 genome 26 >12>16 T C,G,GT 375.758 PASS AT=>12>13>16,>12<14>16,>12<15>16,>12>1345950>13>16;DP=30 GT:DP:AD:GL:GQ:GP:XD:MAD 0:30:29,0,0,1:-2.035213,-39.310056,-39.310056,-66.403241:256:-1.386294:28.879999:29 genome 28 >16>19 A G 715.655 PASS AT=>16>17>19,>16<18>19;DP=31 GT:DP:AD:GL:GQ:GP:XD:MAD 0:31:31,0:-1.246421,-72.511024:256:-0.693147:28.879999:31 genome 61 >39>41 C T 738.643 PASS AT=>39>40>41,>39>1587106>41;DP=34 GT:DP:AD:GL:GQ:GP:XD:MAD 0:34:33,1:-2.043936,-75.607397:256:-0.693147:36.452053:33
Note that 3, 7, and 61 produce variant calls but have only 1 read supporting alt call.
Has this flag been properly implemented or does coverage here mean something different than what I'm thinking?
Thanks!
As discussed in #2474,
vg augment
doesn't scale well when moving from 30X to 300X coverage. It seems that at that depth, we can expect errors at every base, and the graph gets turned into noise.I think we need
agument
to take into account coverage and/or quality. For coverage, we could keep track of a count for each found breakpoint and ignore those that don't meet a cutoff (then adapt later code to not expect every breakpoint in the graph). The output GAM would also need edits. Similar story for quality, where we can check it from the GAM before adding a breakpoint.It'd be nice not to have to choose a coverage cutoff. Perhaps once all the breakpoints are scanned, it could choose a value from some summary statistics and apply the cutoff then.