kunalkathuria / SVXplorer

Structural Variant Caller
MIT License
9 stars 2 forks source link

Memory and ALT #96

Closed waschf closed 4 years ago

waschf commented 4 years ago

Hi there,

i am trying to use SVXplorer in order to call SVs in plant genomes. I have a few Issues regarding that. First, when I try to call on my full data set (around 60x coverage) I end up using more than 256 gb of ram and after that the cluster where I am working from terminates the program. It also takes around 10 days to arrive at the point where the ram usage is that high and the program gets terminated.

From here on I tried to use --subsample and see if that works, and at least the program finishes. The results are quite well for Deletions, as there is a lot of overlap between the SVX results and e.g. manta results. But with BNDs the results are weird, and the REF field is always N. All BNDs then look something like this :

Bvchr5_un.sca003 859302 11 N N[Bvchr5_un.sca003:859738[ . PASS SVTYPE=BND;CIPOS=0,228;CIEND=-228,0;PROBTYPE=Translocation;MATEID=12;GROUPID=G4;SUPPORT=20;PE=20;SR=0;IMPRECISE;CINFO=0.141950716132 GT:SU:PE:SR ./.:20:20:0

and the overlap between BNDs called by SVX and manta is minimal.

kunalkathuria commented 4 years ago

Hi,

Glad --subsample worked. This could be an indication that the coverage in your BAM file is inordinately high at some point (even a small segment of the reference alignment), causing a bottleneck at some stage. If so inclined, you could run SVXplorer in debug mode (-d), then look at the tail end of the run.log file to identify where the high coverage became an issue.

As far as BNDs are concerned, SVXplorer follows the specifications in https://github.com/samtools/hts-specs/blob/master/VCFv4.2.pdf (see section 5.4 on the format). Typically, BND calls may be very different for different structural variant callers. SVXplorer designates any event that does not qualify as DEL, TD (Tandem Duplication), INV, INS (cut or copy) after respective read-depth filtering as BND. The comment field in the VCF and BEDPE output will designate the likely type of event this may otherwise have been. SVXplorer does not currently support REF/ALT sequence identification and the seqs are thus designated as "unknown."

Hope this helps.

waschf commented 4 years ago

Thanks for your reply. Ill try to run it in debug mode.

Regarding the second part, just to clarify two things:

1.) As far as I understood it now, it is to be expected that the REF field is always N and that the BND annotations is always something like " N[chr1:859530[ " instead of " A[chr1:859530[ " or " C[chr1:859530[ " is this correct?

2.) You mentioned that SVXplorer designates all variants that do not qualify as DEL, TD, INV, INS as BND. Is this also true for cases with Info fields like this:

SVTYPE=BND;CM=CopyPasteInsertion

So the whole group of the 4 corresponding BNDs has failed to classify as INS (copy)?

Thank you for your help!

kunalkathuria commented 4 years ago

Yes, 1) is correct. Regarding 2), yes, it is quite possible for it to be a copy-paste insertion but read depth could not confirm it. Each BND event has 2 entries arising from the same breakpoint pair (see VCF doc). There are other types of BNDs noted, e.g. inversions that were not supported at both ends will be written as BND with CM=inversion.

Regarding your previous question, though it worked out for this data set, if this happens again try setting a mapping quality threshold > default from the command line (e.g. 10).

kunalkathuria commented 4 years ago

To clarify, a pair of related breakpoints gives rise to a single BND entry in the VCF. A copy-paste insertion has 2 pairs of breakpoints (1 relating destination and source location 1, and the other relating destination and source location 2), giving rise to 4 BND entries in the VCF.