Open zwh82 opened 1 year ago
The set-the-reference-and-then-surject workflow you did at the end should be able to get you your reads in the space of any sample you want.
get the reads information about which paths mapped to and which postion (nodes id) in the paths.
You want the positions of the reads on all the paths in the GFA, for all the samples? And not the reads in the space of any one sample?
You could surject to each sample one at a time and use those.
You could also make a GBZ where all your samples are reference (--set-tag "reference_samples=GCF_000271965.2 GCF_12345 GCF_789"
), and then use vg annotate -x allref.gbz -a reads.gam -p >annotated.gam
to tag reads in the GAM with their positions along reference paths, in the refpos
array in each GAM record.
You can then read the GAM binary format, or dump it to JSON with vg view -aj annotated.gam
. To filter down to just read names and lists of reference position structures (which will have a name
, an offset
, and an is_reverse
), you would use something like vg view -aj annotated.gam | jq -c '[.name, (.refpos // [])]'
.
But that won't get you positions on the path in terms of node IDs, just in terms of offsets from the beginning of the path.
If you want your reads mapped in terms of the original GFA, without any of vg's node splitting, you can use vg giraffe --named-coordinates -o GAF
to get your output in the text-based GAF format, in terms of the original GFA node names. This doesn't tell you where the reads are on the paths at all, but if you have the GAF and you have the GFA you can compare the node names that the GAF reads visit against the node names that each GFA path visits, and find the nodes at which each read intersects with each path it touches. Though this will be in terms of original GFA node names, and not vg node IDs.
Many thanks. @adamnovak
I have got reads position on the path in terms of original GFA node IDs with vg giraffe --named-coordinates -o GAF
.
But when I set all samples as reference and used vg annotate -x allref.gbz -a reads.gam -p >annotated.gam
, it reported error.
all_ref_annotated.log
But as you said, I can also get the paths where the reads from though searching with GAF and GFA.
By the way, many gfa files(small each, but around a thousand) -> a big .vg
file -> vg giraffe
mapping is a right way?
I think the problem might be that if you use --named-coordinates
, the vg tools don't know how to understand the result and read it back: https://github.com/vgteam/vg/issues/4164. Your log looks pretty consistent with it trying to get the length of a node that doesn't exist, which it what it would do if it tried to interpret named coordinate aligned reads as node ID aligned reads.
If you are happy with the node names in 2_genomes_merge.gfa
, then I think you are fine using the process you showed to combine GFAs. But I'm not sure that vg combine
handles string node names in a good way when reading GFA, and the .vg
format doesn't actually hold the original GFA node names.
How big are your input genomes? If you made one FASTA file from them how big is it?
On Thu, Nov 16, 2023, 10:45 Adam Novak @.***> wrote:
I think the problem might be that if you use --named-coordinates, the vg tools don't know how to understand the result and read it back: #4164 https://github.com/vgteam/vg/issues/4164. Your log looks pretty consistent with it trying to get the length of a node that doesn't exist, which it what it would do if it tried to interpret named coordinate aligned reads as node ID aligned reads.
If you are happy with the node names in 2_genomes_merge.gfa, then I think you are fine using the process you showed to combine GFAs. But I'm not sure that vg combine handles string node names in a good way when reading GFA, and the .vg format doesn't actually hold the original GFA node names.
— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/4158#issuecomment-1814829707, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOIVWPWX5L5LQYUAGLYEY7K3AVCNFSM6AAAAAA7JLBWICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUHAZDSNZQG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
An alternative is to map with bwa mem and then use gfainject to get GAF mappings. Then you'll have the reference position for each alignment and it's graph position in nodes traversed by the GAF record. https://github.com/chfi/gfainject
Thanks. @adamnovak
I think the problem might be that if you use --named-coordinates, the vg tools don't know how to understand the result and read it back: https://github.com/vgteam/vg/issues/4164. Your log looks pretty consistent with it trying to get the length of a node that doesn't exist, which it what it would do if it tried to interpret named coordinate aligned reads as node ID aligned reads.
I know where the problem is. But I note that I can change the name
to node id
from #4148 , now the command vg annotate -x allref.gbz -a reads.gam -p >annotated.gam
can work. Is it a right way? I find every read matches several paths, and are they fully match?
But I'm not sure that vg combine handles string node names in a good way when reading GFA, and the .vg format doesn't actually hold the original GFA node names.
I don't quite understand what you mean. You mean that merging GFAs with vg combine
will disrupt node IDs order and reorder? When I look at the big GFA generated with vg combine
and vg convert
, I find it merged GFAs only incrementally by node IDs and the order and node IDs is right. By the way, the gfa convert to .vg
cann't presever the original order in old vg version 1.40, but now it can. It seems that because the new version vg uses PackedGraph
format as .vg
format instead of original Protobuf
format.
@ekg My work is focus on metagenomes. Most stain genome is only a few Mbp in size. In my small test, merging all to a FASTA is not a big file. Actually I have the genome data of about 30,000 strains, but now I don't really determine selection criteria for data. Maybe your method is a good alternative, it depends on quality, speed and memory.
But I note that I can change the name to node id from https://github.com/vgteam/vg/issues/4148 , now the command vg annotate -x allref.gbz -a reads.gam -p >annotated.gam can work. Is it a right way? I find every read matches several paths, and are they fully match?
I think that this is indeed a right way to do it. It looks like several paths visit the part of the graph where these reads are, and the reads are getting coordinates along all of them. The read probably won't be a "full match" for any of the paths; it is going to differ from the sequences along those paths in some ways. But it will be at about those positions along those paths.
When I look at the big GFA generated with vg combine and vg convert, I find it merged GFAs only incrementally by node IDs and the order and node IDs is right.
It could be that we always end up re-numbering the GFAs' nodes so that they are numbered consecutively in the same order as they were in the original GFAs, and in the same order as you passed the GFAs on the command line. I don't think that we promise that we will do it like that, and I don't really encourage depending on it. Unless we decide to enforce and test for that behavior, it might change again in a new version of vg.
If you want it to keep working forever, we need to add a test that checks for that behavior to the vg combine
tests.
Indeed, I want it to keep work. I think I need to trace the original nodes in subsequent mapping and analysis.
Thanks. @adamnovak
I think the problem might be that if you use --named-coordinates, the vg tools don't know how to understand the result and read it back: #4164. Your log looks pretty consistent with it trying to get the length of a node that doesn't exist, which it what it would do if it tried to interpret named coordinate aligned reads as node ID aligned reads.
I know where the problem is. But I note that I can change the
name
tonode id
from #4148 , now the commandvg annotate -x allref.gbz -a reads.gam -p >annotated.gam
can work. Is it a right way? I find every read matches several paths, and are they fully match?But I'm not sure that vg combine handles string node names in a good way when reading GFA, and the .vg format doesn't actually hold the original GFA node names.
I don't quite understand what you mean. You mean that merging GFAs with
vg combine
will disrupt node IDs order and reorder? When I look at the big GFA generated withvg combine
andvg convert
, I find it merged GFAs only incrementally by node IDs and the order and node IDs is right. By the way, the gfa convert to.vg
cann't presever the original order in old vg version 1.40, but now it can. It seems that because the new version vg usesPackedGraph
format as.vg
format instead of originalProtobuf
format.@ekg My work is focus on metagenomes. Most stain genome is only a few Mbp in size. In my small test, merging all to a FASTA is not a big file. Actually I have the genome data of about 30,000 strains, but now I don't really determine selection criteria for data. Maybe your method is a good alternative, it depends on quality, speed and memory.
Hi,
Recently, I conducted some similar testing. Could you please share the complete commands to obtain the results shown in the figure ?
Best Regards, Chens
@Sh1ne111 I didn't choose this approach later, so I didn't explore it in depth. Here is my example I tried. I hope it will help you.
vg giraffe -Z gfa_data/2_merge_index.giraffe.gbz -m gfa_data/2_merge_index.min -d gfa_data/2_merge_index.dist -i -f anonymous_reads.fq.gz -t 30 -p --named-coordinates > gfa_mapped.gam
vg view -aj gfa_mapped.gam | jq '.path.mapping = [(.path.mapping // [])[] | (.position.node_id = .position.name) | (.position.name = "")]' | vg view -JGa - > gfa_coordinates.gam
vg gbwt -Z --set-tag "reference_samples=GCF_000271965.2 GCF_026428315.1 GCF_006402735.1 GCF_016698685.1 GCF_000226035.2 GCF_025985505.1 GCF_006401635.1 GCF_000495455.2 GCF_006401215.1 GCF_020827275.1 GCF_006400955.1 GCF_000281215.1 GCF_022354785.1 GCF_027595085.1 GCF_018141045.1 GCF_006402015.1 GCF_003671955.1" --gbz-format -g all_ref.gbz 2_merge_index.giraffe.gbz
vg annotate -x gfa_data/all_ref.gbz -a gfa_coordinates.gam -p > all_ref_annotated.gam
vg validate gfa_data/2_genomes_merge.gfa -a gfa_coordinates.gam
Thanks 👍
Hi,
I want to map short reads to a large gfa graph and get the reads information about which paths mapped to and which postion (nodes id) in the paths. The big gfa graph merges the gfa graphs from different species (bacteria) built by PGGB.
I have used a simple data to try. I have mapped reads to gfa graph with
vg giraffe
. But I have no way to get the all mapping reads information as I memtioned. When I usevg surject
, it seems that it only can work with a reference path and I only get the reads information from the reference path. Without no the reference path, it reported a error. In additon, I don't find the way to get the read positons in the gfa graph.Here are some of the codes I tried.
Thank you in advance!