vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

Can VG simulate the third-generations long reads? #4284

Open tanger-code opened 1 month ago

tanger-code commented 1 month ago

Hi.

Now I have the .gbz graph file, and I want to simulate the third-generations long reads data from a pangenome graph. Can VG simulate the third-generations long reads? Or if there is some methods to do this?

Any advice would be very helpful to me. Thanks.

jeizenga commented 1 month ago

Although vg sim can run with long read input, it's really designed for short reads. If you use it to generate long reads, you won't get very realistic errors or a realistic read length distribution. In our own testing and development, we've used pbsim to simulate long reads. You would probably want to generate the reads from FASTAs of sample haplotypes, rather than directly from the GBZ file.

tanger-code commented 1 month ago

Thank you! And can I use vg sim and the .gbz file to generate short reads using vg sim -x graph.xg **-g graph.gbz** -m SAMPLE -n 1000 -l 150 -a > SAMPLE.gam ? Now I have the .gbz file of all chromosomes pangenome graph. And I want to generate short reads only for chr21. Do I need to withdraw the .gbz file of chr21? I don't find Related command.

tanger-code commented 1 month ago

Although vg sim can run with long read input, it's really designed for short reads. If you use it to generate long reads, you won't get very realistic errors or a realistic read length distribution. In our own testing and development, we've used pbsim to simulate long reads. You would probably want to generate the reads from FASTAs of sample haplotypes, rather than directly from the GBZ file.

I'm simulating long reads using pbsim3 and the output is .maf file. If I want to do some simulation experiment such as calling SV based on the simulation reads, can I use the maf file as the truth set? Or use some public truth set?

Do you have any suggestions?

jeizenga commented 1 month ago

Looking through our script, it seems that we used the maf2sam subcommand of bioconvert.

adamnovak commented 1 month ago

@tanger-code If you want to simulate from just one named path in the graph, you can use the -P option to vg sim.

But that simulates from just that path; it won't include variants in the graph that leave the embedded path.

I don't think we have a way to simulate from the connected component of the graph that contains a path, other than using vg chunk --components -p name-of-path to pull out that subgraph and then simulating from it.