vgteam / sv-genotyping-paper

MIT License
31 stars 6 forks source link

Doubt on hal to vg processing #6

Open RenzoTale88 opened 4 years ago

RenzoTale88 commented 4 years ago

Dear vg team, I've got a series of questions for you regarding calling variants starting from a vg graph generated from a hal archive.

I've successfully ran cactus chromosome-wise on five different mammalian genomes, and converted it to a vg archive and trying to process it with vg (v1.19.0).

Looking at your script in this repository, I guessed that the way to generate the archive is as follow:

`

Prepare the vg file

hal2vg --noAncestors --hdf5InMemory --refGenome myref chr1.hal > tmp1.vg vg mod -X 32 tmp1.vg > tmp1.chop32.vg vg explode tmp1.chop32.vg exploded/ vg mod -D tmp1.chop32.vg > chr1.final.vg

Generate final xg index

vg index -T -x chr1.final.xg -F thread_names chr1.final.vg

Generate final gbwt index

vg index -T -G chr1.final.gbwt -F thread_names ./exploded/component0.vg

Generate pruned graph

vg index -T -x exploded/component0.xg -F thread_names exploded/component0.vg vg prune -u -g chr1.final.gbwt -m node_mapping chr1.final.vg > chr1.final.prune.vg vg index -g chr1.final.gcsa -f node_mapping -b ./tmp chr1.final.prune.vg `

This process should produce a set of chr1.final.* with all the different indexes, am I right? Also, I have been looking at the documentation for vg index, but I can't find the option -F. Has it been replaced in recent releases?

Finally, I also have a set of short variants in vcf format. I know it can be added with the command vg add, but I'm unsure on when add them. Should it be before vg mod -D, or directly on the final vg graph?

Thank you in advance for the help,

Andrea

eldariont commented 4 years ago

Hi Andrea,

thanks for your questions. Yes, the commands found here (https://github.com/vgteam/sv-genotyping-paper/blob/master/yeast/graphs/cactus_all/Snakefile) should produce a set of chr1.final.* with three indices (XG, GCSA, and GBWT).

Your commands are almost right with a few modifications:

# Prepare the vg file
hal2vg --noAncestors --refGenome myref chr1.hal > tmp1.vg
vg mod -X 32 tmp1.vg > tmp1.chop32.vg
vg explode tmp1.chop32.vg exploded/
vg mod -D exploded/component0.vg > chr1.final.vg

# Generate final gbwt index
vg index -T -G chr1.final.gbwt -F thread_names ./exploded/component0.vg
# Generate final xg index
vg index -T -x chr1.final.xg -F thread_names chr1.final.vg
# Generate pruned graph
vg prune -u -g chr1.final.gbwt -m node_mapping chr1.final.vg > chr1.final.prune.vg
vg index -g chr1.final.gcsa -f node_mapping -b ./tmp chr1.final.prune.vg

The -F option has been removed from vg index in v1.16. Maybe @jltsiren or @glennhickey can elaborate on how to modify the pipeline for the current version of vg.

Regarding the short variants, I'm not sure why you would like to add more variants to the graph produced by cactus. It should already contain all types of variation between the genomes used as input.

Best David

RenzoTale88 commented 4 years ago

Thank you for you quick reply! I need to include variants from short read sequencing of a large sample of individuals, independent from the assemblies used for the study.

Thank you again for the help,

Andrea

glennhickey commented 4 years ago

Yeah, the -F can be safely dropped now. For this workflow, I think you can actually get away with skipping the gbwt construction altogether (so no index -T -G or -F needed, then vg prune can be run with only -r).

Running vg add at the end is probably the only way to include your VCF variants, but be advised it is fairly experimental.

On Mon, Oct 7, 2019 at 8:46 AM RenzoTale88 notifications@github.com wrote:

Thank you for you quick reply! I need to include variants from short read sequencing of a large sample of individuals, independent from the assemblies used for the study.

Thank you again for the help,

Andrea

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/sv-genotyping-paper/issues/6?email_source=notifications&email_token=AAG373S2DCOH3UVDUZ5R7D3QNMVR5A5CNFSM4I5XYHR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAQGCIQ#issuecomment-538992930, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG373THVQZKRTYY2FBKZXDQNMVR5ANCNFSM4I5XYHRQ .

RenzoTale88 commented 4 years ago

Ok then I'll proceed as you recommend. Just one question: if I add the short variants at the end, should I rebuild the indexes?

Briefly, process and index the cactus archive, add the short variants, re-build the indexes?

Thank you again for your help

Andrea

glennhickey commented 4 years ago

If you want those variants in your indexes, then yes, you'd need to rebuild them. In general, there's no reason to build the indexes before vg add, so you can just build them once at the end.

I'm not sure what you're planning to do with the graph, but my intuition is that it will probably work better if you don't run vg add on it. Do feel free to try and report any problems back here though!

On Mon, Oct 7, 2019 at 9:37 AM RenzoTale88 notifications@github.com wrote:

Ok then I'll proceed as you recommend. Just one question: if I add the short variants at the end, should I rebuild the indexes?

Briefly, process and index the cactus archive, add the short variants, re-build the indexes?

Thank you again for your help

Andrea

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/sv-genotyping-paper/issues/6?email_source=notifications&email_token=AAG373QLA2J7H5Y4URUMODLQNM3PZA5CNFSM4I5XYHR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAQLNTY#issuecomment-539014863, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG373WZGJ6SCWWYXICCA6LQNM3PZANCNFSM4I5XYHRQ .

RenzoTale88 commented 4 years ago

Thank you very much for your suggestions! Do you think there is some "safe procedure" to include short variants from a separate experiments? Just because I think they could add value to the graph since adding further source of known variability to the dataset.

RenzoTale88 commented 4 years ago

Sorry to come back again with this, but I got one more questions (which may be a simple): if I drop the paths in the graph using the vg mod -D, when I will call the variants on the given dataset, will that map the variant on the respective chromosome? Or is it better for me to retain the paths skipping the -D step?

In the sense that I'm proceeding as follow:

`for i in seq(1 N); do hal2vg --noAncestors --refGenome myref chr${i}.hal > tmp${i}.vg vg mod -X 32 tmp${i}.vg > tmp${i}.chop32.vg vg explode tmp${i}.chop32.vg exploded/ vg mod -D exploded/component0.vg > chr${i}.final.vg vg prune -r chr${i}.final.vg > chr${i}.final.pruned.vg done

vg index -x all.xg $(for i in seq(1 N); do echo chr${i}.final.vg ; done)

vg index -g all.gcsa $(for i in $(seq 1 N); do echo chr${i}.final.pruned.vg; done) `

But won't that create a single space with all the sequences as a single large graph? Or it will be able to distinguish the chromosomes somehow?

Thank you, sorry for the possible simple question.

Andrea

glennhickey commented 4 years ago

Variants are represented with respect to an embedded path in the graph. So you need to keep paths in the graph in order to do any kind of variant calling, either from GAM with vg call, or directly from the paths with vg deconstruct -e.

On Fri, Oct 11, 2019 at 7:35 AM RenzoTale88 notifications@github.com wrote:

Sorry to come back again with this, but I got one more questions (which may be a simple): if I drop the paths in the graph using the vg mod -D, when I will call the variants on the given dataset, will that map the variant on the respective chromosome? Or is it better for me to retain the paths skipping the -D step?

Thank you, sorry for the possible simple question.

Andrea

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/sv-genotyping-paper/issues/6?email_source=notifications&email_token=AAG373XXI636X6QWZ4QHCR3QOBQGXA5CNFSM4I5XYHR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA7W2TI#issuecomment-541027661, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373WSXEFSO4JLV2UDSHDQOBQGXANCNFSM4I5XYHRQ .

RenzoTale88 commented 4 years ago

Ok perfect, thank you very much. So I will just drop the vg mod -D command and proceed with the subsequence indexing.

Thank you again!

Andrea