iobio / vcf.iobio.io

MIT License
27 stars 11 forks source link

Quality score distribution #40

Open AlistairNWard opened 7 years ago

AlistairNWard commented 7 years ago

Is the quality score distribution set with a defined maximum? I'm looking at a UGP exome and the scale runs from 0-200, but looking at the VCF file I see that the qualities are well into the thousands.

AlistairNWard commented 7 years ago

I reran a couple of times and it always seemed to be set at 0-200:

screen shot 2017-04-13 at 11 24 49 am

so then I took a random slice of the vcf to see the range of scores, and >200 seems pretty common (at least in this sample):

screen shot 2017-04-13 at 11 24 37 am

tonydisera commented 7 years ago

I looked at the client-side code and I don't see any cutoffs for the quality score. I wonder if the mean for the random slices of the vcf exceed 200? Would you mind sending me a link to the VCF you are looking at? I'll run the backend service vcfstatsalive to see what kind of quality scores are returned. What chrom is this screen print from?

AlistairNWard commented 7 years ago

Ok, you can use the platinum file:

https://s3.amazonaws.com/iobio/samples/vcf/platinum-exome.vcf.gz

and the region (so this is a different file, and another random region) 20:100000-1000000. This is the qualities here:

AlistairNWard commented 7 years ago

screen shot 2017-04-13 at 12 01 17 pm

tonydisera commented 7 years ago

Good news, Al. Yi's vcfstatsalive program has an argument for setting the upper limit of the quality scores. Right now, we don't pass in this argument from vcf.iobio, so it is using the default of 200. I tried out the the argument -Q 1000 and the histogram shows the new upper limit. What would be a good default?

screen shot 2017-04-13 at 1 01 14 pm
tonydisera commented 7 years ago

I've pushed a new commit to production to set the upper limit on quality scores to 1000. This is ready to validate.

AlistairNWard commented 7 years ago

Ok, I think we need to make this dynamic, and then include the ability to zoom like the read coverage distribution in bam.iobio. Obviously, this is more work, so just upping the default is the best bet for the moment. I have to admit that it seems weird that all the spikes are at 2% and pretty discrete. Is this the real distribution?