Open evanbiederstedt opened 6 years ago
Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...
I don't think there is a recommended way yet, but in theory we'd put something in vg stats
, or a similar namespace. It sounds like each alignment gets a score/identity, but each read gets a MAPQ (which is computed from the multiple alignments per read), so if you want to count mapped reads you'd want to check the MAPQs.
I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa. Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph.
You might get fewer MAPQ > 0 reads in the VCF graph, as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty.
Hi @edawson
Thanks for the quick response.
You might get fewer MAPQ > 0 reads in the VCF graph
So, this is to say that the MAPQ=0 reads are filtered out? With my tests, I the minimum MAPQ I ever saw was 1.
I guess....is there any way I could access these? At the moment, I'm not sure how I'd use the GAM/JSON to access a (non-read) alignment with no MAPQ values and a read with MAPQ=0, right?
as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty.
I guess this gets in the theory and purpose of a graph genome, but this is a bit of a conceptual problem, no? If there are reads which align on a reference with some MAPQ score against a somewhat unique sequence, a graph genome could end up dismissing these reads because of the inherent variation within the graph?
Maybe the problem is the MAPQ metric applied for graph genomes?
@edawson
I think the addendum to the question above is, would accessing the MAPQ==0 reads require hacking on my part in the source code? It might be useful to get the lines in question which would need to be changed. Perhaps this is a feature request to be made in another issue?
In protobuf, 0 values are not written. So reads with mapq=0 won't have any mapping_quality field when you look at them.
To extract these, you can use jq
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG > mapq0.gam
To find non-zero values, for example 55, you can do something similar
`vg view -a mygam.gam | jq -c 'select(.mapping_quality==55)' | vg view -JaG
mapq55.gam`
Going through JSON is slow, but it allows for fairly general queries of GAM files.
On Mon, May 28, 2018 at 5:16 AM, evanbiederstedt notifications@github.com wrote:
@edawson https://github.com/edawson
I think the addendum to the question above is, would accessing the mapq==0 reads require hacking on my part in the source code? It might be useful to get the lines in question which would need to be changed. Perhaps this is a feature request to be made in another issue?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1701#issuecomment-392470551, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7od9rf8k9YFPi0ydu5ltjaPT_uxvks5t28CKgaJpZM4UMiWe .
Hi @glennhickey
Thanks for the help
For the commands given, i.e.
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG > mapq0.gam
I see an output for
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)'
but there is an error with
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG
Error:
[vg view] error: no filename given
Are the above commands correct?
Sorry, I forgot a - at the very end of both examples.
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG - > mapq0.gam vg view -a mygam.gam | jq -c 'select(.mapping_quality==55)' | vg view -JaG
mapq55.gam
On Mon, May 28, 2018 at 12:34 PM, evanbiederstedt notifications@github.com wrote:
Hi @glennhickey https://github.com/glennhickey
Thanks for the help
For the commands given, i.e.
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG > mapq0.gam
I see an output for
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)'
but there is an error with
vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG
Error:
[vg view] error: no filename given
Are the above commands correct?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1701#issuecomment-392566913, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7vb4eF-1C2313HnnB54B3VMoX_z9ks5t3Cb_gaJpZM4UMiWe .
@glennhickey Thanks for this. Ah yes, this makes sense, as you would need to pipe JSON format into jq
@glennhickey @edawson
So, I'm still a bit confused what to make of this. As I say, my priors would be that the graph has more reads mapped than the "reference-only" graph.
Here are the numbers I now see, after parsing the JSONs (with the caveat that maybe I made a mistake somewhere):
REFERENCE-ONLY length(json)= 15742494 length(scores) = 15530966 length(identity) = 15530966 length(mapping_quality)=14358612
Using the commands above to check for (mapping_quality | not) values:
length(json)= 1383882 length(scores) =1172354 length(identity) =1172354 length(mapping_quality)=0 ## this makes sense! good!
GRAPH length(json)= 15742494 ### (It's interesting these are the same length...) length(scores) =15523701 length(identity) =15523701 length(mapping_quality)=14280344
Using the commands above to check for (mapping_quality | not) values:
length(json)= 1462150 length(scores) =1243357 length(identity) =1243357 length(mapping_quality)= 0
Attempts at addition:
(1)
reference: length(scores) + length(scores)_mapq0 => 15530966 + 1172354 = 16703320
graph : length(scores) + length(scores)_mapq0 => 15523701 + 1243357 = 16767058
That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!
However, I'm not sure if I should celebrate just yet...as @edawson mentioned,
so if you want to count mapped reads you'd want to check the MAPQs.
(2)
reference: length(mapping) + length(json)_mapq0 => 14358612+ 1383882 = 15742494
graph : length(mapping) + length(json)_mapq0 => 14280344 + 1462150 = 1574249
15742494 is the length of the original GAMs for both. Why would that be?
(3)
reference: length(scores) + length(json)_mapq0 => 15530966 + 1383882 = 16914848
graph : length(scores) + length(json)_mapq0 => 15523701 + 1462150 = 16985851
I'm not sure what this means, but here the value is greater for the GRAPH
(4)
reference: length(mapping) + length(scores)_mapq0 => 14358612+ 1172354 = 15530966
graph : length(mapping) + length(scores)_mapq0 => 14280344 + 1243357 = 15523701
Here, the number of mapped reads with MAPQ>=1 is greater in the reference-only graph versus the VCF graph. Using the length of the JSONs with values for scores
, the value for the reference-only graph is still larger....I'm not sure what this means.
Questions:
(A) How should I interpret each of these values?
(B) What would be the recommended way to measure the total number of mapped reads in this case? If the reference-only graph still has more reads mapped than the VCF graph...why would this be?
Could the VCF affect things? For example, I think the VCF did not have information on the sex chromosomes or the mitochondrial chromosome. Would that possibly make a difference?
Some reads map more ambiguously to the linear reference than to the graph. Other reads map more ambiguously to the graph than the linear reference. It is possible that the latter group is larger. We are currently studying this effect and ways to mitigate it, and hope to have some results to publish shortly. Filtering rare variants out of the VCF is one simple approach.
On Mon, May 28, 2018 at 7:10 PM, evanbiederstedt notifications@github.com wrote:
@glennhickey https://github.com/glennhickey @edawson https://github.com/edawson
So, I'm still a bit confused what to make of this. As I say, my priors would be that the graph has more reads mapped than the "reference-only" graph.
Here are the numbers I now see, after parsing the JSONs (with the caveat that maybe I made a mistake somewhere):
REFERENCE-ONLY length(json)= 15742494 length(scores) = 15530966 length(identity) = 15530966 length(mapping_quality)=14358612
Using the commands above to check for (mapping_quality | not) values:
length(json)= 1383882 length(scores) =1172354 length(identity) =1172354 length(mapping_quality)=0 ## this makes sense! good!
GRAPH length(json)= 15742494 ### (It's interesting these are the same length...) length(scores) =15523701 length(identity) =15523701 length(mapping_quality)=14280344
Using the commands above to check for (mapping_quality | not) values:
length(json)= 1462150 length(scores) =1243357 length(identity) =1243357 length(mapping_quality)= 0
Attempts at addition:
(1)
reference: length(scores) + length(scores)_mapq0 => 15530966 + 1172354 = 16703320 graph : length(scores) + length(scores)_mapq0 => 15523701 + 1243357 = 16767058
That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!
However, I'm not sure if I should celebrate just yet...as @edawson https://github.com/edawson mentioned,
so if you want to count mapped reads you'd want to check the MAPQs.
(2)
reference: length(mapping) + length(json)_mapq0 => 14358612+ 1383882 = 15742494 graph : length(mapping) + length(json)_mapq0 => 14280344 + 1462150 = 1574249
15742494 is the length of the original GAMs for both. Why would that be?
(3)
reference: length(scores) + length(json)_mapq0 => 15530966 + 1383882 = 16914848 graph : length(scores) + length(json)_mapq0 => 15523701 + 1462150 = 16985851
I'm not sure what this means, but here the value is greater for the GRAPH
(4)
reference: length(mapping) + length(scores)_mapq0 => 14358612+ 1172354 = 15530966 graph : length(mapping) + length(scores)_mapq0 => 14280344 + 1243357 = 15523701
Here, the number of mapped reads with MAPQ>=1 is greater in the reference-only graph versus the VCF graph. Using the length of the JSONs with values for scores, the value for the reference-only graph is still larger....I'm not sure what this means.
Questions:
(A) How should I interpret each of these values?
(B) What would be the recommended way to measure the total number of mapped reads in this case? If the reference-only graph still has more reads mapped than the VCF graph...why would this be?
Could the VCF affect things? For example, I think the VCF did not have information on the sex chromosomes or the mitochondrial chromosome. Would that possibly make a difference?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1701#issuecomment-392615131, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7mo0bgxzjfy9w3zBVrVQ6gq7BkAFks5t3IPcgaJpZM4UMiWe .
Hi @glennhickey
Thanks for the help.
It is possible that the latter group is larger.
I see. I guess this could make sense.
One other thing I don't entirely understand yet: Using the JSON from (mapping_quality | not)
above, there are values for identity
and scores
. Should I conclude that these are MAPQ==0 alignments?
For each case, the size of the json and the size of the identity
fields are different, e.g.
length(json)= 1383882, length(scores) =1172354
That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!
The graph with variation will have more ambiguity in it. This will result in more MAPQ=0 alignments. MAPQ=0 means the top two alignments had the same score.
Look for reads that get MAPQ=0 and score=0. For a short read alignment in vg this means they are unmapped.
In this case you're getting something like 0.4% more, which isn't a big deal IMO if you're mapping more reads correctly overall. It isn't clear that it's wrong to get more or less MAPQ=0 alignments if it's happening because you've learned that your read has more optimal mapping locations.
I think we should go ahead and expose this at the CLI, either in the existing stats
subcommand, the sift
subcommand, or via a new gamstats
subcommand`. Open to suggested on the CLI.
possible examples:
vg stats -G reads.gam -m graph.vg
vg gamstats -m reads.gam
vg sift -S -m reads.gam
Great work on vg, I've just come back to it and mapping is a lot more performant.
Has this stats/report functionality been provided yet for GAM files ? I'm investigating multimapping against references from about 50 highly related bacteria (metagenome), and am struggling to get my head around stats, multimapping and mapping efficiency.
Also, although I've tried to set a high pc identity threshold 0.95 (script below) I'm still getting lower identity mappings in the result jsons
eg:
{"identity": 0.2638888888888889, "mapping_quality": 41, "name": "NB55
#!/bin/bash
# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020
threads=56
for fq in `ls *.fastq`
do
echo "Aligning FASTQ SE short"
echo $i
nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 --min-ident 0.95 -f $fq > $fq.nomulti.gam
nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 1 --min-ident 0.95 -f $fq > $fq.1multi.gam
nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 5 --min-ident 0.95 -f $fq > $fq.5multi.gam
nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 10 --min-ident 0.95 -f $fq > $fq.10multi.gam
#--surject-to bam
done
for gam in `ls *.gam`
do
echo "Converting gam"
echo $i
nohup srun -p short -c 1 -t 119 vg view -a $gam > $gam.json
nohup srun -p short -c 1 -t 119 vg index -d $gam.index -N $gam
done
Script 2 ... stats attempt:
#!/bin/bash
# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020
for filename in `ls *.gam`
do
echo $i
nohup srun -p short -c 4 -t 119 vg view -a $filename > $filename.json
grep identity $filename.json > $filename.json.al.json
echo "Alignment stats"
echo "Total lines (reads): "
wc -l $filename.json
echo "Total lines with identity (mapped reads): "
grep -c identity $filename.json
done
vg stats -a will give you some mapping statistics. You can also filter by identity using vg filter -p -u -f
On Wed, Aug 5, 2020 at 8:16 AM Colin Davenport notifications@github.com wrote:
Great work on vg, I've just come back to it and mapping is a lot more performant.
Has this stats/report functionality been provided yet for GAM files ? I'm investigating multimapping against references from about 50 highly related bacteria (metagenome), and am struggling to get my head around stats, multimapping and mapping efficiency.
Also, although I've tried to set a high pc identity threshold 0.95 (script below) I'm still getting lower identity mappings in the result jsons
eg:
{"identity": 0.2638888888888889, "mapping_quality": 41, "name": "NB55
!/bin/bash
Run vg view for all vg gam files -> json on SLURM
Colin D, August 2020
threads=56
for fq in
ls *.fastq
do echo "Aligning FASTQ SE short" echo $i nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 --min-ident 0.95 -f $fq > $fq.nomulti.gam nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 1 --min-ident 0.95 -f $fq > $fq.1multi.gam nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 5 --min-ident 0.95 -f $fq > $fq.5multi.gam nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 10 --min-ident 0.95 -f $fq > $fq.10multi.gam #--surject-to bam
done
for gam in
ls *.gam
do echo "Converting gam" echo $i nohup srun -p short -c 1 -t 119 vg view -a $gam > $gam.json nohup srun -p short -c 1 -t 119 vg index -d $gam.index -N $gam
done
Script 2 ... stats attempt:
!/bin/bash
Run vg view for all vg gam files -> json on SLURM
Colin D, August 2020
for filename in
ls *.gam
do echo $i nohup srun -p short -c 4 -t 119 vg view -a $filename > $filename.json grep identity $filename.json > $filename.json.al.json echo "Alignment stats" echo "Total lines (reads): " wc -l $filename.json echo "Total lines with identity (mapped reads): " grep -c identity $filename.json
done
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1701#issuecomment-669159058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373RA32EB2RJGPRIVQX3R7FESJANCNFSM4FBSEWPA .
Hi there
This is another clarification question:
(1)
Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...
(2)
I created a non-branching graph against the reference genome with one component per chromosome using the following command:
I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa.
Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph. That's strange, right? We'd expect that, each alignment against the reference-only graph should also be a valid alignment against the reference+VCF graph....yet, the reference-only graph has a higher number of mapped reads (using the measurement of mapped reads in discussed in (1))
Any ideas what could be going on here? Perhaps I'm measuring the total number of mapped reads incorrectly?