vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Question on resulting VG variants #2594

Open RenzoTale88 opened 4 years ago

RenzoTale88 commented 4 years ago

Hello, I'm calling variants using a cactus generated graph genome, generated and indexed as previously reported in another issue (#2362). Alignments and post processing is described in the issue #2546 with the recommended changes. The only additional modification I've done is to increase the chunk size to 10 Mb, with a 100Kb overlap, to reduce the number of tasks to perform. I've managed also to call the variants on each chunk using the command:

vg call ./FILTER/$sample/${bname}.aug.vg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t 1 -p genome1.1 -l 120000000 -o $offset | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz

This generated a vcf file, refered to the reference genome of interest (called genome1). I've noticed that some of the variants are represented with alternate allele N, and that no deletions appear. Is the N indicative of deletions?

Thank you for your support, Andrea

ekg commented 4 years ago

The N is probably found in the input genomes.

Do you see Ns in the input FASTA sequences that you gave to cactus? Do you find it in the GFA output of the graph (from vg view)?

On Wed, Jan 8, 2020 at 1:34 PM RenzoTale88 notifications@github.com wrote:

Hello, I'm calling variants using a cactus generated graph genome, generated and indexed as previously reported in another issue (#2362 https://github.com/vgteam/vg/issues/2362). Alignments and post processing is described in the issue #2546 https://github.com/vgteam/vg/issues/2546 with the recommended changes. The only additional modification I've done is to increase the chunk size to 10 Mb, with a 100Kb overlap, to reduce the number of tasks to perform. I've managed also to call the variants on each chunk using the command:

vg call ./FILTER/$sample/${bname}.aug.vg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t 1 -p genome1.1 -l 120000000 -o $offset | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz

This generated a vcf file, refered to the reference genome of interest (called genome1). I've noticed that some of the variants are represented with alternate allele N, and that no deletions appear. Is the N indicative of deletions?

Thank you for your support, Andrea

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2594?email_source=notifications&email_token=AABDQEPW7LG2L2IBZOBP6QDQ4XB35A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEYAOQQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEII5MGFSRJNZYFIJS3Q4XB35ANCNFSM4KEHVJTA .

RenzoTale88 commented 4 years ago

Yes there are gaps in the input genomes which are coded as Ns, and yes I see them when I run vg view. Are there changes to be done when creating the input graph from cactus ouputs?

ekg commented 4 years ago

There is nothing you need to do, and this is all expected. You're getting these calls because in some places reads align across the gaps (Ns).

On Wed, Jan 8, 2020 at 1:53 PM RenzoTale88 notifications@github.com wrote:

Yes there are gaps in the input genomes which are coded as Ns, and yes I see them when I run vg view. Are there changes to be done when creating the input graph from cactus ouputs?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2594?email_source=notifications&email_token=AABDQEOF7RESKSVSXZENEPTQ4XED7A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMI6RQ#issuecomment-572034886, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJWHM4A7IHMDVVSRJ3Q4XED7ANCNFSM4KEHVJTA .

RenzoTale88 commented 4 years ago

I'm not I'm getting this right sorry. So, I'm getting variants like the one below:

genome1.1      10333   10528667_10528803       ATC     NNN     237.434 PASS    DP=239  GT:DP:AD:GL     1/1:239:1,237:-663.514901,-219.974202,-161.973251

So, the reference allele is ATC, but the alternate is NNN. So, the gap is supposed to be in the reads? Sorry, maybe I'm just misinterpreting what you're telling me right now, but I'd like to completely understand what I'm getting as an output.

ekg commented 4 years ago

Sorry, I misunderstood. This is indeed strange. Maybe @glennhickey can give some insight.

It would indeed look like the gap is in the reads, or the alternate allele that the caller has determined. It looks like a bug.

On Wed, Jan 8, 2020 at 2:07 PM RenzoTale88 notifications@github.com wrote:

I'm not I'm getting this right sorry. So, I'm getting variants like the one below:

genome1.1 10333 10528667_10528803 ATC NNN 237.434 PASS DP=239 GT:DP:AD:GL 1/1:239:1,237:-663.514901,-219.974202,-161.973251

So, the reference allele is ATC, but the alternate is NNN. So, the gap is supposed to be in the reads? Sorry, maybe I'm just misinterpreting what you're telling me right now, but I'd like to completely understand what I'm getting as an output.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2594?email_source=notifications&email_token=AABDQELG6O6OYIHVV3QTDWLQ4XFZ7A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMK6OY#issuecomment-572043067, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKD2I5WKKLZFJD42L3Q4XFZ7ANCNFSM4KEHVJTA .

RenzoTale88 commented 4 years ago

So, the problem appear when i use the augmented vg graph, but is not present when using the xg index. Don't know if this might help tracking down the problem. I'm using vg version 1.20.0-78-gc71c000.

glennhickey commented 4 years ago

There must be N's in the reads that are getting augmented into the graph. @adamnovak and I were just chatting about this on Slack the other day. We used to have a filter for this, but it got lost in refactoring. I think I will have to add it back. In the meantime, you can just ignore these calls.

It may be worth double-checking that this is not a bug (and this NNN is really in the vast majority of your reads as aligned to this position of the graph). I can do this for you if you share the agumented graph / gam in question.

RenzoTale88 commented 4 years ago

Sorry at the moment I do not have the augmented chunks anymore (I've deleted them after finishing to call the variants). I can recreate them, but it might take a bit. In the meanwhile, I'm encountering another kind of problem. I've created chunks of ~10Mb with an overlap of 100Kb to process in parallel. However, some chunks (the initial parts of some chromosomes) do not give any errors but simply generate an empty vcf file. Not sure what is the reason. Is there a way to chunk the gam and the vg into chromosomes? In this way I can try to process them and see whether it creates other issues. Right now I'm doing it as follow:

vg chunk -x all.xg -a reads.gam -P PATHS.txt -c 50 -g -s 250000000 -o 100000 -b ${sample}_call_chunk -t 4 -E ${sample}.chunklist -f

Using a chunk size higher than the longest chromosome available should allow for the split into single chromosomes, but I'd like to know whether there is a better, more specific way to do it.

Thanks again for the help

Andrea

glennhickey commented 4 years ago

Yes, this is now how toil-vg works -- by splitting into chromosomes. This can be done with the -C option. The -O pg option is pretty important for performance as well.

vg chunk -x all.xg -a reads.gam -P PATHS.txt -C -g -b ${sample}_call_chunk -t 4 -E ${sample}.chunklist -O pg

The other steps (augment / pack / call) should work fine on chromosome-sized graphs. You can save some time by not splitting the GAM (no chunk -a -g) option, and just passing in the entire GAM when running augment on each chromosome chunk.

Finally: if you do split the GAM, you'll need the very latest master branch for vg chunk to work properly with -C -a.

RenzoTale88 commented 4 years ago

Ok perfect, I'm proceeding as you suggested. Does vg augment accept also pg files as an input? Because just running the command shows me that it accept a vg file (just double checking of course) using vg 1.20.0?

glennhickey commented 4 years ago

Yes, augment definitely accepts pg files in 1.21.0. If it isn't working in 1.20.0, then switching to the newer release should fix it.

On Thu, Jan 9, 2020 at 9:47 AM RenzoTale88 notifications@github.com wrote:

Ok perfect, I'm proceeding as you suggested. Does vg augment accept also pg files as an input? Because just running the command shows me that it accept a vg file (just double checking of course) using vg 1.20.0?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2594?email_source=notifications&email_token=AAG373UQWMQB5JBCRTC65LLQ442JTA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIQRNJA#issuecomment-572593828, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373R5OU7C5RBBOK7YUCTQ442JTANCNFSM4KEHVJTA .

RenzoTale88 commented 4 years ago

Ok it is running now. I have one more question, more related to computational time/space. Using vg augment I will produce an augmented graph and reads (gam), that are then the inputs for vg pack. My question is: the augmented gam will contain only the reads that map to the specific chromosome I'm processing, or it will contain all of the reads mapped to the graph? I' asking because this might lead to much higher space requirements, and therefore I got to plan the process more carefully.

Thanks again Andrea

glennhickey commented 4 years ago

The augmented GAM will only contain reads that map to the graph. So every read in your original GAM should appear in at most one augmented GAM.

Note that this requires the -s option for vg augment. (otherwise you'll get an error telling you to use it when a read that's outside the graph is encountered).

On Fri, Jan 10, 2020 at 3:03 AM RenzoTale88 notifications@github.com wrote:

Ok it is running now. I have one more question, more related to computational time/space. Using vg augment I will produce an augmented graph and reads (gam), that are then the inputs for vg pack. My question is: the augmented gam will contain only the reads that map to the specific chromosome I'm processing, or it will contain all of the reads mapped to the graph? I' asking because this might lead to much higher space requirements, and therefore I got to plan the process more carefully.

Thanks again Andrea

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2594?email_source=notifications&email_token=AAG373WZCFFENCS6LJVXKRDQ5ATWXA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEITA3XY#issuecomment-572919263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373S4CP5IFVGSZPPWHSLQ5ATWXANCNFSM4KEHVJTA .

RenzoTale88 commented 4 years ago

Ok thank you very much, I'll give it a go with this additional option. Thanks a lot again

RenzoTale88 commented 4 years ago

Hi again, I've tried to do as you suggested. Unfortunately, I'm getting problems at augmenting the graph, getting an error both with the full gam and the chunked gam. I've proceeded first chunking the graph into single chromosome graph:

# Chunking
vg chunk -x all.xg -P ./LISTS/PATHS.txt -C -O pg -b ${sample}_call_chunk -t ${NSLOTS} -E ${sample}.chunklist

Then, for every chunk I've augmented the graph with the full dataset (bname is the basename of each chunk):

# Augmenting
vg augment ${bname}.pg ${sample}.filtered.gam -m 4 -s -t $NSLOTS -A ${bname}.aug.gam > ${bname}.aug.vg

This step however finishes with errors every time I run it. The error I get is the following:

ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_ogOCBO/stacktrace.txt
Please include the stack trace file in your bug report!
error[VPKG::load_one]: Correct input type not found while loading handlegraph::PathHandleGraph

Below, the stack trace of the issue encountered. Is it that I'm doing something wrong? Thank you for your help

Andrea

stacktrace_13012020.txt

glennhickey commented 4 years ago

This type of error usually means it's completely failing to read an input file. The strange thing is that the error message implies it's the graph (which is not loaded as the type shown), but the stack trace implies it's the GAM.

Does vg validate ${bname}.pg run through? How about vg view -a ${sample}.filtered.gam | head -10 ?

RenzoTale88 commented 4 years ago

Ok sorry, I might have used the wrong stack trace (I'm running on a cluster and have to retrieve the stacktraces from each node).

I've tried again between yesterday night and today morning, this time chunking the graph into chromosomes creating sub-vg files instead of pg. I've alweays use version 1.20.0 for all the operations, although using the version 1.21.0 didn't changed the results. the command I've used are the same in the post above, just creating a vg instead of a pg. Below the new stacktrace and the error printed is the following:

ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_TUvUqS/stacktrace.txt
Please include the stack trace file in your bug report!
error[VPKG::load_one]: Correct input type not found while loading handlegraph::PathHandleGraph

Odd enough, all the chromosomes failed with the exception of the last one among them. Not really sure why, since it is an arrayed job, and repeat every step for every chromosome the same way.

Running vg validate on the different subgraphs didn't return any error. The software proceed generating the augmented gam files, but then fails when augmenting the vg.

Let me know what I can try next

stacktrace_15012020.txt

glennhickey commented 4 years ago

I see. This stack trace is pretty clear about it being vg augment crashing. I fixed some bugs in the function it seems to be crashing in in a recent PR (#2599), so it may work with the latest master branch of vg. If you can share the input and command line for this vg augment crash with me, I'd be happy to debug it.

These issues have been helpful, so thanks for your patience. We've mostly been using vg call for genotyping -- ie without augmenting. I'm in the midst of doing some whole-genome benchmarks for de-novo calling with augmentation, but there's clearly a little work left to be done. Any more feedback appreciated.

RenzoTale88 commented 4 years ago

Thank you very much for you help! I might have some problems sharing the data since they are quite large (~40X mammalian sized genomes), sorry about that. I'll try downloading and compile the latest vg branch, compile it and then retry to run it. In the meanwhile, I'm trying to run it on the whole graph instead of on single chromosomes. To download the latest branch I simply need to git clone or should I specify some specific branch name?

glennhickey commented 4 years ago

Yeah, just a fresh clone with no branch specified. But keep in mind that in this version, you can no longer combine graphs with cat (use vg combine instead -- it will be like this in the next release). And there's a bug (to be fixed shortly) where vg ids -s doesn't work

RenzoTale88 commented 4 years ago

Ok that should be fine. Is the resulting graph from vg combine compliant with the previous versions of vg? I mean, if I combine the graph with vg combine and feed it as input for vg 1.20.0, will it work fine?

glennhickey commented 4 years ago

As far as I know, yes. Good question though. @adamnovak can confirm for sure.
One thing that is not compatible it the output of vg pack. So you need to make sure, in this case, that you're using the same vg-version for vg call and vg pack.

adamnovak commented 4 years ago

Yes, the vg combine output is in vg's Protobuf format and should be readable by v1.20.0.

glennhickey commented 4 years ago

@RenzoTale88 I'm getting a crash with a similar stack trace on my own test. So I suspect that trying with the latest master will not fix it. I do have all the data I need to debug, and will let you know when it's fixed.

RenzoTale88 commented 4 years ago

Ok great then I'll wait your go-on to try upgrade my vg version. In the meanwhile, I've tried again to run vg chunk, vg augment and vg call. With the exception of some vg snarls with some core dumped error, that disappeared on the next run (not really sure why, haven't made any changes to the script from one run to the other), everything worked fine. I'll wait for your fixes anyway, maybe it will make things more stable and easier for my next run.

Below the command I'm using to produce the different analyses:

sample=Sample1
reads=Sample1.filtered.gam

# Chunking
vg chunk -x GRAPH_CACTUS/all.xg -a $reads -g -P ./LISTS/PATHS.txt -C -O vg -b FILTER/${sample}/${sample}_call_chunk -t ${NSLOTS} -E FILTER/${sample}/${sample}.chunklist 
# For every chunk
tgtpath=genome1.1
bname=FILTER/${sample}/${sample}_call_chunk_genome1.1
NSLOTS=4

vg augment ./FILTER/$sample/${bname}.vg ./FILTER/$sample/${bname}.gam -m 4 -s -t $NSLOTS -A ./FILTER/$sample/${bname}.aug.gam > ./FILTER/$sample/${bname}.aug.vg
vg index ./FILTER/$sample/${bname}.aug.vg -x ./FILTER/$sample/${bname}.aug.xg
/${bname}.aug.snarls
vg pack -x ./FILTER/$sample/${bname}.aug.xg -g ./FILTER/$sample/${bname}.aug.gam -Q 5 -o ./FILTER/$sample/${bname}.aug.pack #&& rm ./FILTER/$sample/${bname}.aug.gam
vg mod -r $tgtpath ./FILTER/$sample/${bname}.aug.vg | vg snarls - > ./FILTER/$sample/${bname}.aug.snarls

# Call the variants for the chunk
vg call ./FILTER/$sample/${bname}.aug.xg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t $NSLOTS -p $tgtpath | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz

One question: when I use vg snarls alone I get several core dumped error. This gets better when I use vg mod -r $tgtpath ./FILTER/$sample/${bname}.aug.vg | vg snarls -. However, I'm not sure whether dropping the other paths while computing the snarls will affect the variant calling. You probably already answered to this point in a previous issue I had but I prefer to confirm it before proceeding. Thanks again!

glennhickey commented 4 years ago

I think your crash was due to a bug in the -s option of vg augment. It should be fixed in the master branch now.

vg chunk -C is also fixed to name broken paths better in the master branch. This should stop the vg snarls crash. Using mod -r (which is now paths -r) is fine as a fix, and won't cause any trouble.

vg snarls uses a fair amount of memory. If it's still crashing with the fixed paths, that could be an issue. There was also an instability in vg snarls when running in parallel (which shouldn't be an issue here) that's fixed in the latest vg.

I'm continuing to test on my end.

RenzoTale88 commented 4 years ago

Ok, I can confirm that for a sample it keep crashing on chromosome 1, which is the largest in the genome, even after fixing the paths with vg paths -r. I proceeded as follow:

vg paths -r ganome1.1 $myfile.aug.vg > tmp.aug.vg
vg snarls tmp.aug.vg > $myfile.aug.snarls

Below the stacktrace:

Crash report for vg v1.21.0-114-g9f3eb752d "Fanano"
Stack trace (most recent call last):
#6    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x4daf89, in _start
#5    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x1b44f68, in __libc_start_main
#4    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x40ada2, in main
      Source "src/main.cpp", line 75, in main [0x40ada2]
#3    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x9b6007, in vg::subcommand::Subcommand::operator()(int, char**) const
    | Source "src/subcommand/subcommand.cpp", line 72, in operator()
      Source "/usr/include/c++/7/bits/std_function.h", line 706, in operator() [0x9b6007]
#2    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x93db96, in main_snarl(int, char**)
      Source "src/subcommand/snarls_main.cpp", line 224, in main_snarl [0x93db96]
#1    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb32fb3, in vg::CactusSnarlFinder::find_snarls_parallel()
    | Source "src/snarls.cpp", line 114, in finish
      Source "src/snarls.cpp", line 947, in find_snarls_parallel [0xb32fb3]
#0    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb321a5, in vg::SnarlManager::build_indexes()
    | Source "src/snarls.cpp", line 993, in size
    | Source "/usr/include/c++/7/bits/stl_deque.h", line 1272, in operator-<vg::SnarlManager::SnarlRecord, vg::SnarlManager::SnarlRecord&, vg::SnarlManager::SnarlRecord*>
      Source "/usr/include/c++/7/bits/stl_deque.h", line 356, in build_indexes [0xb321a5]
glennhickey commented 4 years ago

That looks like it's running out of memory. Is that what you're seeing? Any idea how much memory it's using? The only way I can think of to reduce the memory is using more stringent coverage (-m) and / or quality filtering (-Q) in vg augment.

On Tue, Jan 21, 2020 at 6:23 AM RenzoTale88 notifications@github.com wrote:

Ok, I can confirm that for a sample it keep crashing on chromosome 1, which is the largest in the genome, even after fixing the paths with vg paths -r. I proceeded as follow:

vg paths -r ganome1.1 $myfile.aug.vg > tmp.aug.vg vg snarls tmp.aug.vg > $myfile.aug.snarls

Below the stacktrace:

Crash report for vg v1.21.0-114-g9f3eb752d "Fanano" Stack trace (most recent call last):

6 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x4daf89, in _start

5 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x1b44f68, in __libc_start_main

4 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x40ada2, in main

  Source "src/main.cpp", line 75, in main [0x40ada2]

3 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x9b6007, in vg::subcommand::Subcommand::operator()(int, char**) const

| Source "src/subcommand/subcommand.cpp", line 72, in operator()
  Source "/usr/include/c++/7/bits/std_function.h", line 706, in operator() [0x9b6007]

2 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x93db96, in main_snarl(int, char**)

  Source "src/subcommand/snarls_main.cpp", line 224, in main_snarl [0x93db96]

1 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb32fb3, in vg::CactusSnarlFinder::find_snarls_parallel()

| Source "src/snarls.cpp", line 114, in finish
  Source "src/snarls.cpp", line 947, in find_snarls_parallel [0xb32fb3]

0 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb321a5, in vg::SnarlManager::build_indexes()

| Source "src/snarls.cpp", line 993, in size
| Source "/usr/include/c++/7/bits/stl_deque.h", line 1272, in operator-<vg::SnarlManager::SnarlRecord, vg::SnarlManager::SnarlRecord&, vg::SnarlManager::SnarlRecord*>
  Source "/usr/include/c++/7/bits/stl_deque.h", line 356, in build_indexes [0xb321a5]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2594?email_source=notifications&email_token=AAG373S4LT65DH5TRDU2YK3Q63LM3A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJPM7RI#issuecomment-576638917, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373Q7IFCRJUOUWBH2MVDQ63LM3ANCNFSM4KEHVJTA .

RenzoTale88 commented 4 years ago

No, the script never reaches the top usage of memory. I can try to use a more stringent coverage filtering, to see if it makes a difference (some samples are 30X other almost 40X, so using the same threshold might have given some problems).

adamnovak commented 4 years ago

It looks like it isn't the reserve() call at src/snarls.cpp line 993 that is failing, but the size() call, like something has upset the std::deque internals.

I'd run the offending snarls command under gdb, and get a stack trace that way to see if it really is stopping us inside the deque size calculation, and, if so, what is wrong with the deque.

You could also try it with -t 1 for using a single thread. I think snarl finding is parallel, and it could be that a signal from one thread is ending up getting received by another. Our stack trace code I thought was supposed to work around that, but maybe it doesn't always succeed?

On 1/21/20, RenzoTale88 notifications@github.com wrote:

No, the script never reaches the top usage of memory. I can try to use a more stringent coverage filtering, to see if it makes a difference (some samples are 30X other almost 40X, so using the same threshold might have given some problems).

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vgteam/vg/issues/2594#issuecomment-576725953

glennhickey commented 4 years ago

I can take a look if you can share the graph.

glennhickey commented 4 years ago

For what it's worth, I finally have this pipeline running through reliably with and without gam chunking via toil-vg call. Up until recently, I was getting bogged down in lot of the same crashes you've described.

That said, I'm running on a graph made with vg construct, which is no doubt much less complex than yours from cactus, so perhaps there are still some issues. I will try to run a cactus graph soon.

For the vg snarls crash one thing to try if you haven't already is to run with the latest vg master.

RenzoTale88 commented 4 years ago

Hi sorry for my late reply, have missed the email notifying them. I'm testing the latest version right now, starting from the chunking onward. I'll keep you updated on how it develops. Thanks again for the support

RenzoTale88 commented 4 years ago

Hi, I've tried with vg v1.21.0-160 and seems to be working. The only odd thing is, when I submit an array of job to process each chromosome separately some of them will generate an empty vcf without creating any error message or failure signal. I can simply rerun them and they'll just work fine, but I find it a puzzling situation.

Thanks again for the support, I'll try now to rerun the whole pipeline with the new version of the software and see whether I can simply run every step correctly.

Andrea

RenzoTale88 commented 4 years ago

Hi, sorry to come back to this, but I got some update. I'm running the analyses using vg-1.21.0-160 as written above. When running the pipeline using vg chunk -C, for some reasons, the software crashes during the snarls augmentation of some chromosomes. However, when I chunk the dataset using:

vg chunk -x all.xg -a FILTER/${sample}/${sample}.filtered.gam -P ./LISTS/PATHS.txt -c 50 -g -s 999999999 -o 100000 -b CHUNK/${sample}/${sample}_call_chunk -t 4 -E CHUNK/${sample}/${sample}.chunklist -f

This produces a series of chunks of chromosomal size (considering the large chunk size), and they work just fine in all the subsequent stages, including calling the variants. My question is: is it a good workaround? I mean, does it produces similar results to the -C chunking approach?

Thanks for your help again Andrea

glennhickey commented 4 years ago

The snarls do seem to be a real issue for these cactus alignments. It's ironic, as the snarls code is actually using the cactus graph internally. Anyway, not using -C in this case limits the size of the variation in your chunks. If you're not interested in large sv's, it probably won't make much difference.

I will take a shot at fixing snarls to be more robust, but won't be able to start for another 2 weeks, unfortunately.

On Thu, Feb 27, 2020 at 4:17 AM RenzoTale88 notifications@github.com wrote:

Hi, sorry to come back to this, but I got some update. I'm running the analyses using vg-1.21.0-160 as written above. When running the pipeline using vg chunk -C, for some reasons, the software crashes during the snarls augmentation of some chromosomes. However, when I chunk the dataset using:

vg chunk -x all.xg -a FILTER/${sample}/${sample}.filtered.gam -P ./LISTS/PATHS.txt -c 50 -g -s 999999999 -o 100000 -b CHUNK/${sample}/${sample}_call_chunk -t 4 -E CHUNK/${sample}/${sample}.chunklist -f

This produces a series of chunks of chromosomal size (considering the large chunk size), and they work just fine in all the subsequent stages, including calling the variants. My question is: is it a good workaround? I mean, does it produces similar results to the -C chunking approach?

Thanks for your help again Andrea

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2594?email_source=notifications&email_token=AAG373UEV4MA5DEQ377RL3DRE6AMHA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENDSO3Y#issuecomment-591865711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373SCKNTLIOWNE3YLPHLRE6AMHANCNFSM4KEHVJTA .

RenzoTale88 commented 4 years ago

So, after calling the variants I actually have variants several thousands of bases long. What do you mean with large SV? Is there a size limit applied by the above command?