madagiurgiu25 / decoil-pre

Reconstruct ecDNA from long-read data using Decoil tool
BSD 3-Clause "New" or "Revised" License
7 stars 0 forks source link

Stuck at "Combine simple circles" step while running Decoil #5

Closed zhober closed 3 months ago

zhober commented 6 months ago

Hi! I'm currently facing an issue while running Decoil: after going through various samples, I've encountered a persistent problem with one specific sample. The process seems to be stuck at a particular step, and there has been no new output for over a day.

Here are the details: Operating System: Windows11, Docker Desktop. Command Used:

# run decoil with your input with standard parameters
$BAM_INPUT="./GBM5.md.bam"
$OUTPUT_FOLDER="./output"
$NAME="GBM5"
$GENOME="./reference.fa"
$ANNO="./anno.gtf"

# run decoil in `sv-reconstruct` mode
docker run -it --platform=linux/amd64 `
    -v ${PWD}/${BAM_INPUT}:/data/input.bam `
    -v ${PWD}/${BAM_INPUT}.bai:/data/input.bam.bai `
    -v ${PWD}/${GENOME}:/annotation/reference.fa `
    -v ${PWD}/${ANNO}:/annotation/anno.gtf `
    -v ${PWD}/${OUTPUT_FOLDER}:/output `
    -t madagiurgiu25/decoil:1.1.1-slim-test `
    decoil -f sv-reconstruct `
            -b /data/input.bam `
            -r /annotation/reference.fa `
            -g /annotation/anno.gtf `
            -o /output -n ${NAME}

Output on command: image

Output file: image

The issue arises with only this specific sample, and all other samples seem to be processed correctly without any hitches.Thank you for your time and assistance.

madagiurgiu25 commented 6 months ago

Dear @zhober,

What is your whole-genome mean coverage? Because I see max coverage is 172615X. Could you try to:

Please let me know if your program still runs so long.

zhober commented 6 months ago

Hi, @madagiurgiu25 I don't think the stuck is because of the mean coverage of the sample. The problematic sample that's causing the process to stall is at 31X coverage, whereas my previously processed samples, with coverage exceeding 30X and even up to 40X, have been handled without any issues.

I checked the output file coverage.bw and found that the difference in mean coverage between these samples is not large. However, the maximum coverage of the sample that failed is about 176722X, while the others that were successful are around 90000X. The issue may lie in this.

madagiurgiu25 commented 6 months ago

Hi @zhober,

Ok I see. So maybe you have a high complexity amplicon and lots of SVs. Could you check if this sample has way more SVs called compared to your other samples?

I still suggest downsampling it to 5X mean coverage or alternatively you could try to filter the SVs more stringent, e.g. --fragment-min-cov 20 --fragment-min-size 1000 --min-vaf 0.2 --min-cov-alt 10.

madagiurgiu25 commented 6 months ago

Another question, where you see that extreme coverage, you see them in the centromeric regions?

zhober commented 6 months ago

Hi @madagiurgiu25, There's no significant disparity for the number of SVs in the problematic sample, as well as other samples, at about 26k.

The maximum coverage in the problematic sample occurs at chr1:143215500 which is not in the centromeric regions. In comparison, three out of four successfully processed samples also exhibit their maximum coverage around this region (within a 1kb range). This region lacks repeat and gene elements. Only one sample identifies an ecDNA within this region, measuring 28kb., and the ecDNA does not contain repeat or gene elements.

Above all, the long time running may only because the big maximum coverage lead to a high complexity amplicon. Additionally, I would like to know what is the appropriate coverage for identifying ecDNA. When the coverage is low, for example, as you recommended, at 5X, would the identification be less accurate?

madagiurgiu25 commented 6 months ago

Hi @zhober,

Thank you for your explanation!

Hi @madagiurgiu25, There's no significant disparity for the number of SVs in the problematic sample, as well as other samples, at about 26k.

The maximum coverage in the problematic sample occurs at chr1:143215500 which is not in the centromeric regions.

On which genome assembly? For hg19 this position lies in a segmental duplication regions (UCSC annotation).

In comparison, three out of four successfully processed samples also exhibit their maximum coverage around this region >(within a 1kb range). This region lacks repeat and gene elements. Only one sample identifies an ecDNA within this region, >measuring 28kb., and the ecDNA does not contain repeat or gene elements.

Above all, the long time running may only because the big maximum coverage lead to a high complexity amplicon.

So your sample could not be processed not even with --fragment-min-cov 20 --fragment-min-size 1000 --min-vaf 0.2 --min-cov-alt 10?

Additionally, I would like to know what is the appropriate coverage for identifying ecDNA. When the coverage is low, for >example, as you recommended, at 5X, would the identification be less accurate?

I did not perfom a limit of detection so I cannot comment on that. However, the data included in the publication is shallow sequencing, i.e. ~5X. It hope it helps.

madagiurgiu25 commented 5 months ago

Hi @zhober,

I will close the issue due to inactivity. Please feel free to reopen in case you have further questions.

Best, Madalina

zhober commented 5 months ago

Hi @madagiurgiu25 ,

Sorry for not replying in time

On which genome assembly? For hg19 this position lies in a segmental duplication regions (UCSC annotation).

I used hg38 assembly,I previously utilized RepeatMasker in UCSC, which did not include this specific region. However, I found that the position lies in segmental duplication region for hg38.

So your sample could not be processed not even with --fragment-min-cov 20 --fragment-min-size 1000 --min-vaf 0.2 --min-cov-alt 10?

I attempted processing with parameters above, but it still failed. I think the issue persists because the filtering criteria did not effectively screen out regions with excessively high coverage.

When I removed reads including the region with the highest coverage with samtools view and reran it, ecDNA was successfully identified. So I propose adding a new parameter, --max-cov, to filter out regions with excessively high coverage. This adjustment is intended to exclude areas composed of repeats, often contributing to high coverage.

Additionally, I notice the use of NanoFilt to filter Nanopore data in your pre-print. Is this step necessary?

madagiurgiu25 commented 3 months ago

Hi @zhober,

Thank you for pointing this out. I have now included a --fragment-max-cov, which should now filter those coverage spikes. You can install the latest version of Decoil which has this change direct from source (see README.md).

Let me know if this works for you.

zhober commented 3 months ago

I'm trying to install the latest version of the Decoil repository on a Python 3.8 environment. However, I encountered an issue with the pickle5 package, which is only compatible with Python 3.5, 3.6, and 3.7, as mentioned in https://pypi.org/project/pickle5/0.0.11/. Additionally, when attempting to install pybedtools, py2bit, and deeptoolsintervals using pip, I encountered an error message× python setup.py bdist_wheel did not run successfully. But these packages can be installed using conda.

madagiurgiu25 commented 3 months ago

I have now removed pickle5 from the installation, please let me know if you can install from source. And if using --fragment-max-cov (default 80000X) Decoil terminates. Thank you.

zhober commented 3 months ago

Now I can intall from source and using --fragment-max-cov can solve the problem.

However, I encountered another issue: different version of Sniffles result in different amounts of ecDNA. When I use sniffles1, the total number of ecDNA is about 50 to 100, with about 10 ecDNA larger than 100MB. But when I use sniffles2, the total number of ecDNA is around 10, with a few ecDNA larger than 100MB (Most of the ecDNA overlap with those identified by Sniffles1). For sniffles1, I use the default parameters of decoil: --min_homo_af 0.7 --min_het_af 0.1 --min_length 50 --cluster --genotype --min_support 4 --report-seq. For sniffles2, I only use the default parameters of sniffles2:sniffles -i ${sample}.sorted.bam -v sv.sniffles.vcf.

What parameters are recommendeded for sniffles2?

madagiurgiu25 commented 3 months ago

Hi @zhober,

I benchmarked only with sniffles1 and sticked with this because for some reason sniffles2 missed DELs in my simulation dataset. I tried several configurations with sniffles2 but still I could not call those DELs (maybe I am doing something wrong). Raised also an issue in sniffles2 repo but did not get a response. So I cannot give you at the moment any advice for sniffles2.

This is why I recommend using the build-in pipeline for SV calling because this was benchmarked on simulated dataset.

Also the calls which are <100MB are not ecDNA, but rather simple circular calls. Many of those simple circular calls are artefacts and lie in low-complexity regions. I would only consider "ecDNA candidates", calls which are >100MB and with an estimated proportion at least 10X and at least 3 x mean_coverage.

zhober commented 3 months ago

Hi @madagiurgiu25, Thank you for your reminder, I would use the build-in pipeline.