from GAM to VCF notes - Githubissues

vgteam / vg

tools for working with genome variation graphs

https://biostars.org/tag/vg/

Other

1.09k stars 193 forks source link

from GAM to VCF notes #794

Open ekg opened 7 years ago

ekg commented 7 years ago

I'm stuck trying to run toil-vg call. I need to set up the per-chromosome GAMs, and I don't see a clear way scripted out in the toil-vg scripts. I'm trying with vg filter and vg chunk.

vg filter seems like it should do the right thing in one pass over the data, but it's taking ages to produce output:

vg filter -x ~/graphs/human/pan/index.xg -R path.bed -B HG002-NA24385-30x.gams/ -t 32 HG002-NA24385-30x.gam

Is this the right way to run it? The lack of output is concerning.

vg chunk has run out of memory on my system.

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 32

I have 256G RAM on the system. Should I set the thread count lower? I'm a bit confused how it would be running out of memory, so I'm looking for some advice about what's worked in the past.

ekg commented 7 years ago

Setting the parallelism to 8 seems to work:

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8

But now I see something strange--- either the compression is working fantastically better due to the binning by chromosome, or something else is going on:

# input
-> % du -hs HG002-NA24385-30x.gam   
135G    HG002-NA24385-30x.gam

# input sorted, with unaligned reads removed
-> % du -hs HG002-NA24385-30x.gam.sorted
85G     HG002-NA24385-30x.gam.sorted    

# per-chromosome binned, unaligned reads removed and no decoy sequences
-> % du -hs HG002-NA24385-30x.gams  
78G     HG002-NA24385-30x.gams

Could this be caused by the removal of the decoy sequences?

ekg commented 7 years ago

I'm now attempting to use toil-vg call to generate calls for the GAMs.

I notice that the first thing it does is re-index the GAMs, then chunk based off of these indexes. This seems like a pretty expensive step if we've already had to make the genome wide index in order to subdivide the alignments.

@glennhickey is there any way to pass the original index directly in rather than make the chromosome-level GAMs and then re-index these to get smaller GAMs?

glennhickey commented 7 years ago

Yeah, by default vg chunk uses the subgraph extraction from xg, which takes a lot of memory to do, as well as to store the vg output. It tries to do subgraphs in parallel when you give it more threads which can bloat the memory quickly.

There is an option (probably not documented well enough) to use id ranges for extraction in vg chunk. It's what gets used in toil-vg

https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_map.py#L343-L352

and it is much faster and takes far less memory. It requires a tsv file for name / lowest-node-id / highest-node-id for each chromosome. Something that toil-vg creates while indexing using vg stats in on the input vg files (which must have contiguous id space per chromosome)

https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_index.py#L306-L334

On Mon, May 8, 2017 at 5:29 AM, Erik Garrison notifications@github.com wrote:

Setting the parallelism to 8 seems to work:

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8

But now I see something strange--- either the compression is working fantastically better due to the binning by chromosome, or something else is going on:

input

-> % du -hs HG002-NA24385-30x.gam 135G HG002-NA24385-30x.gam

input sorted, with unaligned reads removed

-> % du -hs HG002-NA24385-30x.gam.sorted 85G HG002-NA24385-30x.gam.sorted

per-chromosome binned, unaligned reads removed and no decoy sequences

-> % du -hs HG002-NA24385-30x.gams 78G HG002-NA24385-30x.gams

Could this be caused by the removal of the decoy sequences?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/794#issuecomment-299818628, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU .

glennhickey commented 7 years ago

Just a note to clarify:

toil-vg uses the id range heuristic only for splitting into chromosomes. it does a second level of splitting along small path intervals for the calling and that uses the subgraph extraction.

On Mon, May 8, 2017 at 11:04 AM, Glenn Hickey glenn.hickey@gmail.com wrote:

Yeah, by default vg chunk uses the subgraph extraction from xg, which takes a lot of memory to do, as well as to store the vg output. It tries to do subgraphs in parallel when you give it more threads which can bloat the memory quickly.

There is an option (probably not documented well enough) to use id ranges for extraction in vg chunk. It's what gets used in toil-vg

https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_map.py# L343-L352

and it is much faster and takes far less memory. It requires a tsv file for name / lowest-node-id / highest-node-id for each chromosome. Something that toil-vg creates while indexing using vg stats in on the input vg files (which must have contiguous id space per chromosome)

https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_index. py#L306-L334

On Mon, May 8, 2017 at 5:29 AM, Erik Garrison notifications@github.com wrote:

Setting the parallelism to 8 seems to work:

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8

But now I see something strange--- either the compression is working fantastically better due to the binning by chromosome, or something else is going on:

input

-> % du -hs HG002-NA24385-30x.gam 135G HG002-NA24385-30x.gam

input sorted, with unaligned reads removed

-> % du -hs HG002-NA24385-30x.gam.sorted 85G HG002-NA24385-30x.gam.sorted

per-chromosome binned, unaligned reads removed and no decoy sequences

-> % du -hs HG002-NA24385-30x.gams 78G HG002-NA24385-30x.gams

Could this be caused by the removal of the decoy sequences?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/794#issuecomment-299818628, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU .

glennhickey commented 7 years ago

I see what you mean. This was written to save so disk space and IO of copying the whole-genome gam index to each worker, but at the cost of re-indexing. I can imagine this not being worth it on many (most?) computing environments. Is this step taking a lot of time for you? I don't think it'll be too much trouble to add an option to pass through the whole index.

On Mon, May 8, 2017 at 9:32 AM, Erik Garrison notifications@github.com wrote:

I'm now attempting to use toil-vg call to generate calls for the GAMs.

I notice that the first thing it does is re-index the GAMs, then chunk based off of these indexes. This seems like a pretty expensive step if we've already had to make the genome wide index in order to subdivide the alignments.

@glennhickey https://github.com/glennhickey is there any way to pass the original index directly in rather than make the chromosome-level GAMs and then re-index these to get smaller GAMs?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/794#issuecomment-299867635, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7iPI1ph8NuOgbKUsFlSZrcxe7ZAPks5r3xltgaJpZM4NTPyU .

glennhickey commented 7 years ago

Finally: I have noticed the radical differences in gam sizes you show, and as far as I can tell it's entirely due to ordering and compression.

On Mon, May 8, 2017 at 11:10 AM, Glenn Hickey glenn.hickey@gmail.com wrote:

Just a note to clarify:

toil-vg uses the id range heuristic only for splitting into chromosomes. it does a second level of splitting along small path intervals for the calling and that uses the subgraph extraction.

On Mon, May 8, 2017 at 11:04 AM, Glenn Hickey glenn.hickey@gmail.com wrote:

Yeah, by default vg chunk uses the subgraph extraction from xg, which takes a lot of memory to do, as well as to store the vg output. It tries to do subgraphs in parallel when you give it more threads which can bloat the memory quickly.

There is an option (probably not documented well enough) to use id ranges for extraction in vg chunk. It's what gets used in toil-vg

https://github.com/glennhickey/toil-vg/blob/master/src/toil_ vg/vg_map.py#L343-L352

and it is much faster and takes far less memory. It requires a tsv file for name / lowest-node-id / highest-node-id for each chromosome. Something that toil-vg creates while indexing using vg stats in on the input vg files (which must have contiguous id space per chromosome)

https://github.com/glennhickey/toil-vg/blob/master/src/toil_ vg/vg_index.py#L306-L334

On Mon, May 8, 2017 at 5:29 AM, Erik Garrison notifications@github.com wrote:

Setting the parallelism to 8 seems to work:

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8

But now I see something strange--- either the compression is working fantastically better due to the binning by chromosome, or something else is going on:

input

-> % du -hs HG002-NA24385-30x.gam 135G HG002-NA24385-30x.gam

input sorted, with unaligned reads removed

-> % du -hs HG002-NA24385-30x.gam.sorted 85G HG002-NA24385-30x.gam.sorted

per-chromosome binned, unaligned reads removed and no decoy sequences

-> % du -hs HG002-NA24385-30x.gams 78G HG002-NA24385-30x.gams

Could this be caused by the removal of the decoy sequences?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/794#issuecomment-299818628, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU .

ekg commented 7 years ago

Finally: I have noticed the radical differences in gam sizes you show, and as far as I can tell it's entirely due to ordering and compression.

The removal of the unaligned reads can't hurt, but it's good to know you're seeing the boost in compression too.

This was written to save so disk space and IO of copying the whole-genome gam index to each worker, but at the cost of re-indexing.

My "dream" variant calling process for a single node would work directly from the whole graph node2alignment GAM index, generating the chunks and calling them in parallel. It would even be fine to make all the regions directly from the GAM index first, then run jobs across these (maybe remotely). In theory, the GAM index can support very high parallelism in reads, so this could be done very quickly for the whole genome. It also didn't take a long time to build the indexes, even in a non-ideal system (networked storage < SSD).

I'm not sure what the best way to implement this is. Is toil-vg the closest thing or should we consider dropping something into vg itself to support the single-server variant calling pipeline application?

Also, I've not been able to get toil-vg call to run without failure on my system. It seems to be running out of memory in various ways. It's reporting job failures related to exceeding memory, then it's re-trying and it seems to be failing for good then. And when I try to set large memory limits it is completely failing as well.

glennhickey commented 7 years ago

That second one (make all regions directly from gam index then call), should be doable with vg chunk and vg call as they stand, and a small option to toil-vg call (which may already exist) to skip the chromosome level stuff.

That's too bad about toil-vg. Do you mind sending me any error logs? The resource requirements specification seems to be the major bottleneck of people using this. It really needs to be simplified and more automated.

On Mon, May 8, 2017 at 12:25 PM, Erik Garrison notifications@github.com wrote:

Finally: I have noticed the radical differences in gam sizes you show, and as far as I can tell it's entirely due to ordering and compression.

The removal of the unaligned reads can't hurt, but it's good to know you're seeing the boost in compression too.

This was written to save so disk space and IO of copying the whole-genome gam index to each worker, but at the cost of re-indexing.

My "dream" variant calling process for a single node would work directly from the whole graph node2alignment GAM index, generating the chunks and calling them in parallel. It would even be fine to make all the regions directly from the GAM index first, then run jobs across these (maybe remotely). In theory, the GAM index can support very high parallelism in reads, so this could be done very quickly for the whole genome. It also didn't take a long time to build the indexes, even in a non-ideal system (networked storage < SSD).

I'm not sure what the best way to implement this is. Is toil-vg the closest thing or should we consider dropping something into vg itself to support the single-server variant calling pipeline application?

Also, I've not been able to get toil-vg call to run without failure on my system. It seems to be running out of memory in various ways. It's reporting job failures related to exceeding memory, then it's re-trying and it seems to be failing for good then. And when I try to set large memory limits it is completely failing as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/794#issuecomment-299917053, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7tPLemsyeqeEYpCZnbbk5k0fkkIVks5r30IDgaJpZM4NTPyU .