Unusual large memory needed for very large sized coupled BNDs

jamigo commented 3 months ago

We are currenly studying the SVs contained in 9000+ samples and detected by DRAGEN. We are trying to join them altogether and to annotate them in a single step, an approach that has worked for us in other projects, but we are facing an issue that we don't know how to deal with. This is how we proceeded:

We tried annotating the whole bunch of SVs of those 9000+ samples and it demanded too much memory (64GB+).
We divided the problem by chromosome, and everything worked fine except chr2.
We then divided the variants in chr2 by type, and all types worked fine except BND.
We studied those BND variants by their same-chr-coupled size, and we detected overexpressed sizes of 20M-30M and 200M-210M, all related to a small 500bp chr2 region.

We are therefore studying what is happening on this 500bp chr2 region, but at the same time we would like AnnotSV to help us, and it just can't: AnnotSV breaks when dealing with these several thousands of very large coupled BNDs.

We can process all SVs from all chromosomes plus chr2's except the BNDs on that critical region with only 8GB/chromosome, but we tried annotating these 11000+ coupled BNDs, where 4000+ of them are 20M-30M and 3000+ of them are 200M-210M, and 64GB was not enough.

input.annotated.tsv.tmp ends up being 4.3GB before AnnotSV dies. input.SV_RE_intersect.tmp ends up being 13.6GB before AnnotSV dies.

Before trying to solve this problem increasing the memory by trial and error, we would like to know if there's anything else we can do on our side to workaround this issue,

lgmgeo commented 3 months ago

Hi @jamigo,

It's nice to see that AnnotSV is useful and applied on thousands of samples. It’s very motivating for me!

And thank you for your detailed feedback. Your small 500 bp chr2 region looks very interesting. I'm curious if it is located in a segmental duplication region or in a repeatMasker (SINE, LINE...) region or near centromer / telomere, which could explained badly called BNDs.

In your VCF DRAGEN SV input file, I assume your BNDs are annotated with square-bracketed notations in ALT and that reciprocal BNDs (MATEID) are indicated. In your small 500 bp chr2 region, what types of SV do your BNDs correspond to? Is there a particular type (INV, INS...)? Do they have FILTER=PASS (I guess not if this region is indeed complex)?

Anyway, AnnotSV seems in difficulty because of these BND pairs, corresponding to large SVs. So, for now, my advice would be:

To extract the BNDs in your small 500bp region and set the ALT feature to <BND> (instead of the square-bracketed notations in ALT)
Do the same for the corresponding MATEIDs

=> By analyzing only the BNDs and not the SV in its entire width, this should avoid bugging AnnotSV.

In the futur, I'm thinking about integrating/using a database in AnnotSV code (https://github.com/lgmgeo/AnnotSV/issues/15). This should fix the bug. But it's a big job, and I have other implementations to do first. This will be done, I hope before the end of the year at the latest.

I'll keep you posted here.

Best,

Véronique

lgmgeo commented 3 months ago

Note for square-bracketed ALT notation:

The comprehension of the square-bracketed notations relies on the homogenization rules from the variantextractor tool (provided by Rodrigo Martin).

For duplication, inversion, deletion and insertion, AnnotSV returns one full annotation per SV (one full annotation per breakend pair).

jamigo commented 3 months ago

AFAIK, AnnotSV is handling square-bracketed SV notation perfectly fine. The problem we have only happens when AnnotSV deals with several thousands of very-large-same-chromosome-paired BNDs (some >20M, some even >200M), maybe because AnnotSV is not releasing memory when it has to (this one could be checked), or maybe because the underlying bedtools calls demand lots of memory (this one would be difficult to address).

We were in fact thinking about leaving out all these very-large-same-chromosome-paired BNDs found in this "complicated" region, since they are obviously derived from a reference genome feature rather than from each sample's features, but changing the square-bracketed SV notation to "<BND>" for these challenging variants only solved the problem, so we have decided to keep them and handle them with care.

For the record, the "complicated" region that seems to have high homology across the same chromosome and even on other chromosomes through the genome is chr2:32916100-32916600 (hg38).

lgmgeo commented 3 months ago

changing the square-bracketed SV notation to "<BND>" for these challenging variants only solved the problem, so we have decided to keep them and handle them with care.

Perfect!

For the record, the "complicated" region that seems to have high homology across the same chromosome and even on other chromosomes through the genome is chr2:32916100-32916600 (hg38).

This is a Repeat (G)n region
No RefSeq curated genes
The "Human Gene LINC00486 (ENST00000414054.5) from GENCODE V44" corresponds to a long intergenic non-protein coding RNA (long non-coding RNA from RefSeq NR_027098)

In my opinion It is quite normal for SV callers to be inaccurate for BNDs in this region (with short reads especially). The large SVs (>20M, >200M) found in this "complicated" region would appear to be false positive SVs.

lgmgeo / AnnotSV

Unusual large memory needed for very large sized coupled BNDs #232