vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 193 forks source link

How to count the number of mapped reads against a vg? #1701

Open evanbiederstedt opened 6 years ago

evanbiederstedt commented 6 years ago

Hi there

This is another clarification question:

(1)

Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...

(2)

I created a non-branching graph against the reference genome with one component per chromosome using the following command:

$ vg construct -r hg38.fa > output.vg

I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa.

Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph. That's strange, right? We'd expect that, each alignment against the reference-only graph should also be a valid alignment against the reference+VCF graph....yet, the reference-only graph has a higher number of mapped reads (using the measurement of mapped reads in discussed in (1))

Any ideas what could be going on here? Perhaps I'm measuring the total number of mapped reads incorrectly?

edawson commented 6 years ago

Is there a recommended way to count the number of mapped reads against a graph? Currently, I've been parsing the JSONs (which were created from the GAMs). The number of scores and identity values are the same, which is a larger value than the mapping quality values. I'm not sure why that is...

I don't think there is a recommended way yet, but in theory we'd put something in vg stats, or a similar namespace. It sounds like each alignment gets a score/identity, but each read gets a MAPQ (which is computed from the multiple alignments per read), so if you want to count mapped reads you'd want to check the MAPQs.

I map the same set of reads (A) against the "reference-only genome" graph (above) and (B) against a VCF-based graph that uses the same reference genome, hg38.fa. Counterintuitively, it looks like the number of mapped reads decreases when mapping against the VCF-based graph.

You might get fewer MAPQ > 0 reads in the VCF graph, as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty.

evanbiederstedt commented 6 years ago

Hi @edawson

Thanks for the quick response.

You might get fewer MAPQ > 0 reads in the VCF graph

So, this is to say that the MAPQ=0 reads are filtered out? With my tests, I the minimum MAPQ I ever saw was 1.

I guess....is there any way I could access these? At the moment, I'm not sure how I'd use the GAM/JSON to access a (non-read) alignment with no MAPQ values and a read with MAPQ=0, right?

as variation may cause a sequence unique in the reference to appear elsewhere in the graph. If a read suddenly aligns to two places, its MAPQ takes a major penalty.

I guess this gets in the theory and purpose of a graph genome, but this is a bit of a conceptual problem, no? If there are reads which align on a reference with some MAPQ score against a somewhat unique sequence, a graph genome could end up dismissing these reads because of the inherent variation within the graph?

Maybe the problem is the MAPQ metric applied for graph genomes?

evanbiederstedt commented 6 years ago

@edawson

I think the addendum to the question above is, would accessing the MAPQ==0 reads require hacking on my part in the source code? It might be useful to get the lines in question which would need to be changed. Perhaps this is a feature request to be made in another issue?

glennhickey commented 6 years ago

In protobuf, 0 values are not written. So reads with mapq=0 won't have any mapping_quality field when you look at them.

To extract these, you can use jq

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG > mapq0.gam

To find non-zero values, for example 55, you can do something similar

`vg view -a mygam.gam | jq -c 'select(.mapping_quality==55)' | vg view -JaG

mapq55.gam`

Going through JSON is slow, but it allows for fairly general queries of GAM files.

On Mon, May 28, 2018 at 5:16 AM, evanbiederstedt notifications@github.com wrote:

@edawson https://github.com/edawson

I think the addendum to the question above is, would accessing the mapq==0 reads require hacking on my part in the source code? It might be useful to get the lines in question which would need to be changed. Perhaps this is a feature request to be made in another issue?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1701#issuecomment-392470551, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7od9rf8k9YFPi0ydu5ltjaPT_uxvks5t28CKgaJpZM4UMiWe .

evanbiederstedt commented 6 years ago

Hi @glennhickey

Thanks for the help

For the commands given, i.e.

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG > mapq0.gam

I see an output for

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)'

but there is an error with

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view
-JaG

Error:

[vg view] error: no filename given

Are the above commands correct?

glennhickey commented 6 years ago

Sorry, I forgot a - at the very end of both examples.

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG - > mapq0.gam vg view -a mygam.gam | jq -c 'select(.mapping_quality==55)' | vg view -JaG

On Mon, May 28, 2018 at 12:34 PM, evanbiederstedt notifications@github.com wrote:

Hi @glennhickey https://github.com/glennhickey

Thanks for the help

For the commands given, i.e.

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG > mapq0.gam

I see an output for

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)'

but there is an error with

vg view -a mygam.gam | jq -c 'select(.mapping_quality | not)' | vg view -JaG

Error:

[vg view] error: no filename given

Are the above commands correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1701#issuecomment-392566913, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7vb4eF-1C2313HnnB54B3VMoX_z9ks5t3Cb_gaJpZM4UMiWe .

evanbiederstedt commented 6 years ago

@glennhickey Thanks for this. Ah yes, this makes sense, as you would need to pipe JSON format into jq

evanbiederstedt commented 6 years ago

@glennhickey @edawson

So, I'm still a bit confused what to make of this. As I say, my priors would be that the graph has more reads mapped than the "reference-only" graph.

Here are the numbers I now see, after parsing the JSONs (with the caveat that maybe I made a mistake somewhere):

REFERENCE-ONLY length(json)= 15742494 length(scores) = 15530966 length(identity) = 15530966 length(mapping_quality)=14358612

Using the commands above to check for (mapping_quality | not) values:

length(json)= 1383882 length(scores) =1172354 length(identity) =1172354 length(mapping_quality)=0 ## this makes sense! good!

GRAPH length(json)= 15742494 ### (It's interesting these are the same length...) length(scores) =15523701 length(identity) =15523701 length(mapping_quality)=14280344

Using the commands above to check for (mapping_quality | not) values:

length(json)= 1462150 length(scores) =1243357 length(identity) =1243357 length(mapping_quality)= 0

Attempts at addition:

(1)

reference: length(scores) + length(scores)_mapq0 =>  15530966 + 1172354 = 16703320
graph : length(scores) + length(scores)_mapq0 => 15523701 + 1243357 = 16767058 

That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!

However, I'm not sure if I should celebrate just yet...as @edawson mentioned,

so if you want to count mapped reads you'd want to check the MAPQs.

(2)

reference: length(mapping) + length(json)_mapq0 => 14358612+ 1383882 = 15742494 
graph : length(mapping) + length(json)_mapq0 => 14280344 + 1462150 = 1574249

15742494 is the length of the original GAMs for both. Why would that be?

(3)

reference: length(scores) + length(json)_mapq0 =>  15530966 + 1383882  = 16914848
graph : length(scores) + length(json)_mapq0 => 15523701 + 1462150 = 16985851

I'm not sure what this means, but here the value is greater for the GRAPH

(4)

reference: length(mapping) + length(scores)_mapq0 => 14358612+ 1172354 = 15530966
graph : length(mapping) + length(scores)_mapq0 => 14280344 + 1243357 = 15523701

Here, the number of mapped reads with MAPQ>=1 is greater in the reference-only graph versus the VCF graph. Using the length of the JSONs with values for scores, the value for the reference-only graph is still larger....I'm not sure what this means.

Questions:

(A) How should I interpret each of these values?

(B) What would be the recommended way to measure the total number of mapped reads in this case? If the reference-only graph still has more reads mapped than the VCF graph...why would this be?

Could the VCF affect things? For example, I think the VCF did not have information on the sex chromosomes or the mitochondrial chromosome. Would that possibly make a difference?

glennhickey commented 6 years ago

Some reads map more ambiguously to the linear reference than to the graph. Other reads map more ambiguously to the graph than the linear reference. It is possible that the latter group is larger. We are currently studying this effect and ways to mitigate it, and hope to have some results to publish shortly. Filtering rare variants out of the VCF is one simple approach.

On Mon, May 28, 2018 at 7:10 PM, evanbiederstedt notifications@github.com wrote:

@glennhickey https://github.com/glennhickey @edawson https://github.com/edawson

So, I'm still a bit confused what to make of this. As I say, my priors would be that the graph has more reads mapped than the "reference-only" graph.

Here are the numbers I now see, after parsing the JSONs (with the caveat that maybe I made a mistake somewhere):

REFERENCE-ONLY length(json)= 15742494 length(scores) = 15530966 length(identity) = 15530966 length(mapping_quality)=14358612

Using the commands above to check for (mapping_quality | not) values:

length(json)= 1383882 length(scores) =1172354 length(identity) =1172354 length(mapping_quality)=0 ## this makes sense! good!

GRAPH length(json)= 15742494 ### (It's interesting these are the same length...) length(scores) =15523701 length(identity) =15523701 length(mapping_quality)=14280344

Using the commands above to check for (mapping_quality | not) values:

length(json)= 1462150 length(scores) =1243357 length(identity) =1243357 length(mapping_quality)= 0

Attempts at addition:

(1)

reference: length(scores) + length(scores)_mapq0 => 15530966 + 1172354 = 16703320 graph : length(scores) + length(scores)_mapq0 => 15523701 + 1243357 = 16767058

That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!

However, I'm not sure if I should celebrate just yet...as @edawson https://github.com/edawson mentioned,

so if you want to count mapped reads you'd want to check the MAPQs.

(2)

reference: length(mapping) + length(json)_mapq0 => 14358612+ 1383882 = 15742494 graph : length(mapping) + length(json)_mapq0 => 14280344 + 1462150 = 1574249

15742494 is the length of the original GAMs for both. Why would that be?

(3)

reference: length(scores) + length(json)_mapq0 => 15530966 + 1383882 = 16914848 graph : length(scores) + length(json)_mapq0 => 15523701 + 1462150 = 16985851

I'm not sure what this means, but here the value is greater for the GRAPH

(4)

reference: length(mapping) + length(scores)_mapq0 => 14358612+ 1172354 = 15530966 graph : length(mapping) + length(scores)_mapq0 => 14280344 + 1243357 = 15523701

Here, the number of mapped reads with MAPQ>=1 is greater in the reference-only graph versus the VCF graph. Using the length of the JSONs with values for scores, the value for the reference-only graph is still larger....I'm not sure what this means.

Questions:

(A) How should I interpret each of these values?

(B) What would be the recommended way to measure the total number of mapped reads in this case? If the reference-only graph still has more reads mapped than the VCF graph...why would this be?

Could the VCF affect things? For example, I think the VCF did not have information on the sex chromosomes or the mitochondrial chromosome. Would that possibly make a difference?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1701#issuecomment-392615131, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7mo0bgxzjfy9w3zBVrVQ6gq7BkAFks5t3IPcgaJpZM4UMiWe .

evanbiederstedt commented 6 years ago

Hi @glennhickey

Thanks for the help.

It is possible that the latter group is larger.

I see. I guess this could make sense.

One other thing I don't entirely understand yet: Using the JSON from (mapping_quality | not) above, there are values for identity and scores. Should I conclude that these are MAPQ==0 alignments?

For each case, the size of the json and the size of the identity fields are different, e.g.

 length(json)= 1383882, length(scores) =1172354
ekg commented 6 years ago

That means yes, there are perhaps more MAPQ==0 reads on the graph. 16767058 > 16703320!

The graph with variation will have more ambiguity in it. This will result in more MAPQ=0 alignments. MAPQ=0 means the top two alignments had the same score.

Look for reads that get MAPQ=0 and score=0. For a short read alignment in vg this means they are unmapped.

In this case you're getting something like 0.4% more, which isn't a big deal IMO if you're mapping more reads correctly overall. It isn't clear that it's wrong to get more or less MAPQ=0 alignments if it's happening because you've learned that your read has more optimal mapping locations.

edawson commented 6 years ago

I think we should go ahead and expose this at the CLI, either in the existing stats subcommand, the sift subcommand, or via a new gamstats subcommand`. Open to suggested on the CLI.

possible examples: vg stats -G reads.gam -m graph.vg
vg gamstats -m reads.gam vg sift -S -m reads.gam

colindaven commented 3 years ago

Great work on vg, I've just come back to it and mapping is a lot more performant.

Has this stats/report functionality been provided yet for GAM files ? I'm investigating multimapping against references from about 50 highly related bacteria (metagenome), and am struggling to get my head around stats, multimapping and mapping efficiency.

Also, although I've tried to set a high pc identity threshold 0.95 (script below) I'm still getting lower identity mappings in the result jsons

eg:

{"identity": 0.2638888888888889, "mapping_quality": 41, "name": "NB55

#!/bin/bash

# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020

threads=56

for fq in `ls *.fastq`

        do
                echo "Aligning FASTQ SE short"
                echo $i
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60       --min-ident 0.95 -f $fq > $fq.nomulti.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 1  --min-ident 0.95 -f $fq > $fq.1multi.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 5  --min-ident 0.95 -f $fq > $fq.5multi.gam
                nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 10 --min-ident 0.95 -f $fq > $fq.10multi.gam

                #--surject-to bam 
done

for gam in `ls *.gam`

        do
                echo "Converting gam"
                echo $i
                nohup srun -p short -c 1 -t 119 vg view -a $gam > $gam.json
                nohup srun -p short -c 1 -t 119 vg index -d  $gam.index -N $gam

done

Script 2 ... stats attempt:

#!/bin/bash

# Run vg view for all vg gam files -> json on SLURM
# Colin D, August 2020

for filename in `ls *.gam`

        do
                echo $i
                nohup srun -p short -c 4 -t 119 vg view -a $filename > $filename.json
                grep identity $filename.json > $filename.json.al.json

                echo "Alignment stats"
                echo "Total lines (reads): " 
                wc -l $filename.json
                echo "Total lines with identity (mapped reads): " 
                grep -c identity $filename.json
done
glennhickey commented 3 years ago

vg stats -a will give you some mapping statistics. You can also filter by identity using vg filter -p -u -f

On Wed, Aug 5, 2020 at 8:16 AM Colin Davenport notifications@github.com wrote:

Great work on vg, I've just come back to it and mapping is a lot more performant.

Has this stats/report functionality been provided yet for GAM files ? I'm investigating multimapping against references from about 50 highly related bacteria (metagenome), and am struggling to get my head around stats, multimapping and mapping efficiency.

Also, although I've tried to set a high pc identity threshold 0.95 (script below) I'm still getting lower identity mappings in the result jsons

eg:

{"identity": 0.2638888888888889, "mapping_quality": 41, "name": "NB55

!/bin/bash

Run vg view for all vg gam files -> json on SLURM

Colin D, August 2020

threads=56

for fq in ls *.fastq

    do
            echo "Aligning FASTQ SE short"
            echo $i
            nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60       --min-ident 0.95 -f $fq > $fq.nomulti.gam
            nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 1  --min-ident 0.95 -f $fq > $fq.1multi.gam
            nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 5  --min-ident 0.95 -f $fq > $fq.5multi.gam
            nohup srun -p short -c $threads -t 119 vg map -x graph.xg -g graph.gcsa -m short -t $threads --min-chain 60 -M 10 --min-ident 0.95 -f $fq > $fq.10multi.gam

            #--surject-to bam

done

for gam in ls *.gam

    do
            echo "Converting gam"
            echo $i
            nohup srun -p short -c 1 -t 119 vg view -a $gam > $gam.json
            nohup srun -p short -c 1 -t 119 vg index -d  $gam.index -N $gam

done

Script 2 ... stats attempt:

!/bin/bash

Run vg view for all vg gam files -> json on SLURM

Colin D, August 2020

for filename in ls *.gam

    do
            echo $i
            nohup srun -p short -c 4 -t 119 vg view -a $filename > $filename.json
            grep identity $filename.json > $filename.json.al.json

            echo "Alignment stats"
            echo "Total lines (reads): "
            wc -l $filename.json
            echo "Total lines with identity (mapped reads): "
            grep -c identity $filename.json

done

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1701#issuecomment-669159058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373RA32EB2RJGPRIVQX3R7FESJANCNFSM4FBSEWPA .