Read mapping with CHOP - Githubissues

ivargr commented 5 years ago

Hi!

Can CHOP be used to do the actual read mapping, or does it just create a fasta file that I e.g will need to index with BWA and then map reads to? If so, I assume my final alignments will be relative to the paths/sequences in the fasta file produced by CHOP. How do I convert these positions/coordinates to e.g. linear reference genome coordinates?

Also, do you know how the run time scales to bigger graphs? I'm interested in using CHOP on a graph created for the whole human genome with all variants from 1000 genomes -- do you think this would be feasible? I see you have only been testing CHOP on a smaller graph for chr 6 in your preprint, would you be able to share how much time CHOP spent on creating the index for this graph?

Thanks in advance, and if my questions are unclear, I'll be happy to re-formulate them :)

ivargr commented 5 years ago

I tried running CHOP on chromosome 1 of a whole genome human graph created with 1000 genomes variants. CHOP was running using 130 GB of memory for 10 hours, then I gave up. So I assume the memory requirements for all chromosomes running at the same time would be > 1TB, and running time would be at least 10 hours.

tomokveld commented 5 years ago

Can CHOP be used to do the actual read mapping, or does it just create a fasta file that I e.g will need to index with BWA and then map reads to?

Indeed, CHOP will generate a fasta file containing the paths through the graph, and an aligner such as BWA would still be needed to index and align with.

If so, I assume my final alignments will be relative to the paths/sequences in the fasta file produced by CHOP. How do I convert these positions/coordinates to e.g. linear reference genome coordinates?

Yes, the alignments will be with respect to the paths generated by CHOP. Interpretation necessitates translation to graph coordinate space, this is possible, since each path is annotated with an identifier that relates it's origin back to the graph.

Also, do you know how the run time scales to bigger graphs? I'm interested in using CHOP on a graph created for the whole human genome with all variants from 1000 genomes -- do you think this would be feasible? I see you have only been testing CHOP on a smaller graph for chr 6 in your preprint, would you be able to share how much time CHOP spent on creating the index for this graph?

Although not in the biorxiv preprint (which is somewhat outdated), we have evaluated CHOP on the complete human genome with all 1000 genomes variants (CHOP was run on each individual chromosome, the output of which was merged and subsequently indexed using BWA), where we noted a 2-3x increase in read alignment time using BWA compared to linear genome alignment. For example, CHOP can process chromosome 1 (18,989,090 nodes and 25,457,607 edges), in approximately 9 hours (serially), with a peak memory usage of 73 GB.

tomokveld commented 5 years ago

I tried running CHOP on chromosome 1 of a whole genome human graph created with 1000 genomes variants. CHOP was running using 130 GB of memory for 10 hours, then I gave up. So I assume the memory requirements for all chromosomes running at the same time would be > 1TB, and running time would be at least 10 hours.

There's quite a difference in our memory usage, can you share the commands you used to run CHOP?

ivargr commented 5 years ago

Okay, thanks a lot for the answers!

I guess the increased memory might be because my chromosome 1 graph had a few more edges/nodes (23739139 nodes in total).

I ran chop with the command chop -g 1.gfa -k 100 using this gfa (http://folk.uio.no/ivargry/1.gfa -- currently uploading, should be finished uploaded in a few minutes). The memory was then at around 130 GB after an hour or so. As mentioned, this might make sense as my graph had some more nodes/edges. Not sure how -k affects the memory (also not really sure what is a suitable k if I want to map 150 base pair long reads).

tomokveld commented 5 years ago

I see you're running CHOP without the -H flag meaning no haplotyping (if available) is used and all combinatorial paths of length k are emitted. Combined with the size/complexity of your graph, it is no surprise there's such a big difference (the statistics I have given you were for CHOP with haplotyping). Given that the 1000 Genomes data is from 2504 samples (corresponding to 5008 sample paths + 1 reference path through the graph), the number of paths that have to be evaluated is far smaller when using haplotyping.

Aside of haplotype usage, another major contributing factor of memory usage is the value of k. Because as k grows larger, longer paths will be indexed, which can traverse more and more edges and increase the number of paths to be emitted.

I noted that you are not using any haplotype information in your graph (at least I see only one encoded path), meaning that even if -H is used it would not do anything. I think that given the complexity of the graph, it is (at least for CHOP) necessary to include haplotyping to make indexing feasible.

ivargr commented 5 years ago

I see, my intention was to use the haplotype info, but I overlooked -h, and also wasn't aware that my gfa did not contain haplotype information -- my bad.

I'll give it a try again some time later when I have a graph with haplotype info, and I'll let you know if I run into any issues.

Thanks!

tomokveld / CHOP

Read mapping with CHOP #1