How to get all information of reads mapped to all paths instead of a reference path?

zwh82 commented 1 year ago

Hi,

I want to map short reads to a large gfa graph and get the reads information about which paths mapped to and which postion (nodes id) in the paths. The big gfa graph merges the gfa graphs from different species (bacteria) built by PGGB.

I have used a simple data to try. I have mapped reads to gfa graph with vg giraffe. But I have no way to get the all mapping reads information as I memtioned. When I use vg surject, it seems that it only can work with a reference path and I only get the reads information from the reference path. Without no the reference path, it reported a error. In additon, I don't find the way to get the read positons in the gfa graph.

Here are some of the codes I tried.

vg combine -p 34_merged.vg 303_merged.vg > 2_genomes_merge.vg
vg convert -f 2_genomes_merge.vg > 2_genomes_merge.gfa
vg convert 2_genomes_merge.vg -x > 2_genomes_merge.xg

vg autoindex -p 2_merge_index -w giraffe -g 2_genomes_merge.gfa -t 30
vg giraffe -Z gfa_data/2_merge_index.giraffe.gbz -m gfa_data/2_merge_index.min -d gfa_data/2_merge_index.dist -i -f camisim/2_genome_simulate_result/2023.11.13_20.55.14_sample_0/reads/anonymous_reads.fq.gz -t 30 -p > mapped.gam
# vg paths -L -x gfa_data/2_genomes_merge.xg > ref_paths.txt

# error
# warning:[vg::get_sequence_dictionary] No reference-sense paths available in the graph; falling back to generic paths.
# error:[vg::get_sequence_dictionary] No reference or non-alt-allele generic paths available in the graph!
vg surject -A -s -i -x gfa_data/2_merge_index.giraffe.gbz mapped.gam > surject_mapped.sam

vg gbwt -Z --set-tag "reference_samples=GCF_000271965.2" --gbz-format -g ref.gbz 2_merge_index.giraffe.gbz
# warning[vg::Watchdog]: Thread 63 has been checked in for 10 seconds processing: S0R211772/1, S0R211772/2
# warning[vg::Watchdog]: Thread 63 finally checked out after 18 seconds and 759080 kb memory growth processing: S0R211772/1, S0R211772/2
# warning[vg::Watchdog]: Thread 2 has been checked in for 10 seconds processing: S0R1300020/1, S0R1300020/2
# warning[vg::Watchdog]: Thread 113 has been checked in for 10 seconds processing: S0R1842030/1, S0R1842030/2
# warning[vg::Watchdog]: Thread 113 finally checked out after 15 seconds and 97200 kb memory growth processing: S0R1842030/1, S0R1842030/2
# warning[vg::Watchdog]: Thread 22 has been checked in for 10 seconds processing: S0R2752568/1, S0R2752568/2
# warning[vg::Watchdog]: Thread 22 finally checked out after 17 seconds and 22672 kb memory growth processing: S0R2752568/1, S0R2752568/2
# warning[vg::Watchdog]: Thread 2 finally checked out after 68 seconds and 290656 kb memory growth processing: S0R1300020/1, S0R1300020/2

vg surject -A -s -i -x gfa_data/ref.gbz mapped.gam > ref_mapped.sam
vg stats -a mapped.gam
# Total alignments: 6666608
# Total primary: 6666608
# Total secondary: 0
# Total aligned: 6666608
# Total perfect: 4449643
# Total gapless (softclips allowed): 6664056
# Total paired: 6666608
# Total properly paired: 6662682
# Alignment score: mean 157.932, median 160, stdev 3.28823, max 160 (4449643 reads)
# Mapping quality: mean 59.362, median 60, stdev 5.72531, max 60 (6571874 reads)
# Insertions: 740 bp in 739 read events
# Deletions: 1842 bp in 1841 read events
# Substitutions: 2754187 bp in 2754187 read events
# Softclips: 610 bp in 101 read events
# Total time: 2013.28 seconds
# Speed: 3311.32 reads/second

Thank you in advance!

adamnovak commented 1 year ago

The set-the-reference-and-then-surject workflow you did at the end should be able to get you your reads in the space of any sample you want.

get the reads information about which paths mapped to and which postion (nodes id) in the paths.

You want the positions of the reads on all the paths in the GFA, for all the samples? And not the reads in the space of any one sample?

You could surject to each sample one at a time and use those.

You could also make a GBZ where all your samples are reference (--set-tag "reference_samples=GCF_000271965.2 GCF_12345 GCF_789"), and then use vg annotate -x allref.gbz -a reads.gam -p >annotated.gam to tag reads in the GAM with their positions along reference paths, in the refpos array in each GAM record.

You can then read the GAM binary format, or dump it to JSON with vg view -aj annotated.gam. To filter down to just read names and lists of reference position structures (which will have a name, an offset, and an is_reverse), you would use something like vg view -aj annotated.gam | jq -c '[.name, (.refpos // [])]'.

But that won't get you positions on the path in terms of node IDs, just in terms of offsets from the beginning of the path.

If you want your reads mapped in terms of the original GFA, without any of vg's node splitting, you can use vg giraffe --named-coordinates -o GAF to get your output in the text-based GAF format, in terms of the original GFA node names. This doesn't tell you where the reads are on the paths at all, but if you have the GAF and you have the GFA you can compare the node names that the GAF reads visit against the node names that each GFA path visits, and find the nodes at which each read intersects with each path it touches. Though this will be in terms of original GFA node names, and not vg node IDs.

zwh82 commented 1 year ago

Many thanks. @adamnovak I have got reads position on the path in terms of original GFA node IDs with vg giraffe --named-coordinates -o GAF. But when I set all samples as reference and used vg annotate -x allref.gbz -a reads.gam -p >annotated.gam , it reported error. all_ref_annotated.log But as you said, I can also get the paths where the reads from though searching with GAF and GFA.

By the way, many gfa files(small each, but around a thousand) -> a big .vg file -> vg giraffe mapping is a right way?

adamnovak commented 1 year ago

I think the problem might be that if you use --named-coordinates, the vg tools don't know how to understand the result and read it back: https://github.com/vgteam/vg/issues/4164. Your log looks pretty consistent with it trying to get the length of a node that doesn't exist, which it what it would do if it tried to interpret named coordinate aligned reads as node ID aligned reads.

If you are happy with the node names in 2_genomes_merge.gfa, then I think you are fine using the process you showed to combine GFAs. But I'm not sure that vg combine handles string node names in a good way when reading GFA, and the .vg format doesn't actually hold the original GFA node names.

ekg commented 1 year ago

How big are your input genomes? If you made one FASTA file from them how big is it?

On Thu, Nov 16, 2023, 10:45 Adam Novak @.***> wrote:

I think the problem might be that if you use --named-coordinates, the vg tools don't know how to understand the result and read it back: #4164 https://github.com/vgteam/vg/issues/4164. Your log looks pretty consistent with it trying to get the length of a node that doesn't exist, which it what it would do if it tried to interpret named coordinate aligned reads as node ID aligned reads.

If you are happy with the node names in 2_genomes_merge.gfa, then I think you are fine using the process you showed to combine GFAs. But I'm not sure that vg combine handles string node names in a good way when reading GFA, and the .vg format doesn't actually hold the original GFA node names.

— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/4158#issuecomment-1814829707, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOIVWPWX5L5LQYUAGLYEY7K3AVCNFSM6AAAAAA7JLBWICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUHAZDSNZQG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ekg commented 1 year ago

An alternative is to map with bwa mem and then use gfainject to get GAF mappings. Then you'll have the reference position for each alignment and it's graph position in nodes traversed by the GAF record. https://github.com/chfi/gfainject

zwh82 commented 1 year ago

Thanks. @adamnovak

I think the problem might be that if you use --named-coordinates, the vg tools don't know how to understand the result and read it back: https://github.com/vgteam/vg/issues/4164. Your log looks pretty consistent with it trying to get the length of a node that doesn't exist, which it what it would do if it tried to interpret named coordinate aligned reads as node ID aligned reads.

I know where the problem is. But I note that I can change the name to node id from #4148 , now the command vg annotate -x allref.gbz -a reads.gam -p >annotated.gam can work. Is it a right way? I find every read matches several paths, and are they fully match?

But I'm not sure that vg combine handles string node names in a good way when reading GFA, and the .vg format doesn't actually hold the original GFA node names.

I don't quite understand what you mean. You mean that merging GFAs with vg combine will disrupt node IDs order and reorder? When I look at the big GFA generated with vg combine and vg convert, I find it merged GFAs only incrementally by node IDs and the order and node IDs is right. By the way, the gfa convert to .vg cann't presever the original order in old vg version 1.40, but now it can. It seems that because the new version vg uses PackedGraph format as .vg format instead of original Protobuf format.

@ekg My work is focus on metagenomes. Most stain genome is only a few Mbp in size. In my small test, merging all to a FASTA is not a big file. Actually I have the genome data of about 30,000 strains, but now I don't really determine selection criteria for data. Maybe your method is a good alternative, it depends on quality, speed and memory.

adamnovak commented 11 months ago

But I note that I can change the name to node id from https://github.com/vgteam/vg/issues/4148 , now the command vg annotate -x allref.gbz -a reads.gam -p >annotated.gam can work. Is it a right way? I find every read matches several paths, and are they fully match?

I think that this is indeed a right way to do it. It looks like several paths visit the part of the graph where these reads are, and the reads are getting coordinates along all of them. The read probably won't be a "full match" for any of the paths; it is going to differ from the sequences along those paths in some ways. But it will be at about those positions along those paths.

When I look at the big GFA generated with vg combine and vg convert, I find it merged GFAs only incrementally by node IDs and the order and node IDs is right.

It could be that we always end up re-numbering the GFAs' nodes so that they are numbered consecutively in the same order as they were in the original GFAs, and in the same order as you passed the GFAs on the command line. I don't think that we promise that we will do it like that, and I don't really encourage depending on it. Unless we decide to enforce and test for that behavior, it might change again in a new version of vg.

If you want it to keep working forever, we need to add a test that checks for that behavior to the vg combine tests.

zwh82 commented 11 months ago

Indeed, I want it to keep work. I think I need to trace the original nodes in subsequent mapping and analysis.

Sh1ne111 commented 10 months ago

Thanks. @adamnovak

I think the problem might be that if you use --named-coordinates, the vg tools don't know how to understand the result and read it back: #4164. Your log looks pretty consistent with it trying to get the length of a node that doesn't exist, which it what it would do if it tried to interpret named coordinate aligned reads as node ID aligned reads.

I know where the problem is. But I note that I can change the name to node id from #4148 , now the command vg annotate -x allref.gbz -a reads.gam -p >annotated.gam can work. Is it a right way? I find every read matches several paths, and are they fully match?

But I'm not sure that vg combine handles string node names in a good way when reading GFA, and the .vg format doesn't actually hold the original GFA node names.

I don't quite understand what you mean. You mean that merging GFAs with vg combine will disrupt node IDs order and reorder? When I look at the big GFA generated with vg combine and vg convert, I find it merged GFAs only incrementally by node IDs and the order and node IDs is right. By the way, the gfa convert to .vg cann't presever the original order in old vg version 1.40, but now it can. It seems that because the new version vg uses PackedGraph format as .vg format instead of original Protobuf format.

@ekg My work is focus on metagenomes. Most stain genome is only a few Mbp in size. In my small test, merging all to a FASTA is not a big file. Actually I have the genome data of about 30,000 strains, but now I don't really determine selection criteria for data. Maybe your method is a good alternative, it depends on quality, speed and memory.

Hi,

Recently, I conducted some similar testing. Could you please share the complete commands to obtain the results shown in the figure ?

Best Regards, Chens

zwh82 commented 10 months ago

@Sh1ne111 I didn't choose this approach later, so I didn't explore it in depth. Here is my example I tried. I hope it will help you.

vg giraffe -Z gfa_data/2_merge_index.giraffe.gbz -m gfa_data/2_merge_index.min -d gfa_data/2_merge_index.dist -i -f anonymous_reads.fq.gz -t 30 -p --named-coordinates > gfa_mapped.gam
vg view -aj gfa_mapped.gam | jq '.path.mapping = [(.path.mapping // [])[] | (.position.node_id = .position.name) | (.position.name = "")]' | vg view -JGa - > gfa_coordinates.gam
vg gbwt -Z --set-tag "reference_samples=GCF_000271965.2 GCF_026428315.1 GCF_006402735.1 GCF_016698685.1 GCF_000226035.2 GCF_025985505.1 GCF_006401635.1 GCF_000495455.2 GCF_006401215.1 GCF_020827275.1 GCF_006400955.1 GCF_000281215.1 GCF_022354785.1 GCF_027595085.1 GCF_018141045.1 GCF_006402015.1 GCF_003671955.1" --gbz-format -g all_ref.gbz 2_merge_index.giraffe.gbz
vg annotate -x gfa_data/all_ref.gbz -a gfa_coordinates.gam -p > all_ref_annotated.gam
vg validate gfa_data/2_genomes_merge.gfa -a gfa_coordinates.gam

Sh1ne111 commented 10 months ago

Thanks 👍

vgteam / vg

How to get all information of reads mapped to all paths instead of a reference path? #4158