madagiurgiu25 / decoil-pre

Reconstruct ecDNA from long-read data using Decoil tool
BSD 3-Clause "New" or "Revised" License
8 stars 0 forks source link

adjusting decoil reconstruct parameters #14

Open eesiribloom opened 5 months ago

eesiribloom commented 5 months ago

Similar to #11 I'd like to make my decoil ecDNA calls more conservative. But most of my samples have ~25-30x and I don't want to downsample them and loose information unnecessarily.

With default parameters decoil is calling ecDNA in all my samples, which is unlikely. Visually inspecting these regions confirms that some are high-confidence and "look" like ecDNA and others do not have strong evidence.

I have tried increasing --min-vaf to 0.1 which did reduce the number of ecDNA slightly

Given your samples were ~5-10x how should I adjust the parameters accordingly?

What is the difference between --min-covand --fragment-min-cov ?

does --min-cov-alt refer to the number of reads spanning junctions ?

By default does --filter-score remove cycles with estimated copy number /proportion < WGS coverage and should I consider increasing this number ?

madagiurgiu25 commented 5 months ago

Dear @eesiribloom,

I understand your point about not downsampling, however, currently, all similar methods do internally downsampling. Additionally, downsampling to 5-10X will be equivalent with setting all the thresholds higher.

To check the description of the parameters please visit https://github.com/madagiurgiu25/decoil-pre/blob/main/docs/decoil_reconstruct.md. --min-cov represents the minimal read depth DP (from the VCF file), whereas --fragment-min-cov represents the minimal mean coverage spanning a particular genome fragment. Yes, --min-cov-alt represents the reads spanning the junction.

I think in your case you should try to run your data using a setup like: --fragment-min-cov 40 --min-cov 40 --min-cov-alt 20 --min-vaf 0.01 --filter-score 40 --fragment-min-size 1000 --min-sv-len 5000. I would suggest to use the installation from source version to have access to the latest Decoil version.

Let me know if this works for you better.

eesiribloom commented 3 months ago

thanks I will try. What do you think about using variant calls from a somatic SV caller such as nanomonsv which seem to outperform germline callers in cancer data

madagiurgiu25 commented 3 months ago

Dear @eesiribloom,

Thank you for the suggestion. If you have a VCF file from nanomonsv you could try to run decoil (installed from source), but it depends if the VCF is standardized:

decoil reconstruct -b <bamfile> -i <vcffile> -c <coveragefile> --outputdir <outputdir> --name <sample> -r <reference_genome> --fragment-min-cov 40 --min-cov 40 --min-cov-alt 20 --min-vaf 0.01 --filter-score 40 --fragment-min-size 1000 --min-sv-len 5000

Install from source decoil:

git clone https://github.com/madagiurgiu25/decoil-pre.git
cd  decoil-pre

# create conda environment
mamba env create -f environment.yml
conda activate envdecoil

python -m pip install -r requirements.txt
python setup.py build install

# add decoil in $PATH
ROOT=`dirname $(which decoil)`
export PATH=$PATH:$ROOT

If you have a VCF file example from nanomonsv please send me and I can adjust the code to make Decoil compatible with the nanomonsv VCF output.

MJoseMo commented 2 months ago

Dear @madagiurgiu25, I would like to ask if it is necessary to downsample the bam file of example, I've seen that its coberture is ~138

samtools depth -a map.bam | awk '{sum+=$3} END {print "Total coverage: ",sum/NR}'
Total coverage:  0.038994

samtools depth map.bam | awk '{sum+=$3} END {print "Total coverage: ",sum/NR}'       Total coverage:  137.789
madagiurgiu25 commented 1 month ago

The map.bam is an example and contains simulated data, i.e. reads cover only the simulated ecDNA structure. There is no WGS coverage. The flag -a in samtools depth keeps all bins, inclusively those which are 0X covered. And this is how you also compute correct the WGS coverage. Without -a flag, the bins with are 0X covered will be ignored for the calculation. This means the amplified region is 137X, but this is very small. The WGS mean coverage is 0.03X. The map.bam does not require downsampling, this is a very small example.