kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
92 stars 11 forks source link

problems genotyping, dysgu run --sites #76

Closed warthmann closed 9 months ago

warthmann commented 10 months ago

Hello, I am having trouble getting the expected genotyping behaviour from dysgu run --sites. I am appreciating help in understanding the problem, which might be a misunderstanding of the expected behaviour on my part, in which case a clarification would be nice.

I called SVs separately in mutant and wildtype and filtered to get the unique ones. Up to here everything works fine.

dysgu filter \
--normal-vcf wildtype.vcf \
mutant.vcf \
wildtype.bam \
> mutant-vs-wildtype.vcf

as documented, the resulting vcf (mutant-vs-wildtype.vcf) contains the sites predicted to be unique in mutant and contains only one sample: "mutant". I would like to generate a vcf that has these positions but also contains a column for the wildtype sample. To this end I was thinking I can genotype the wildtype sample for the reported positions and then merge the vcfs, however, when I issue below genotyping command, the resulting vcf (wildtype.re_geno.vcf) contains many more variants than the input vcf specified with --sites (mutant-vs-wildtype.vcf). 18,000 vs 100. Is this expected, or am I using it wrongly?

dysgu run -x -p8 --sites  mutant-vs-wildtype.vcf \
reference-genome.fa \
./tmp \
wildtype.bam \
> wildtype.re_geno.vcf

Another observation is that the process took much longer than the original SV calling (double), despite specifying double the processors to use. Thank you very much for your help, the stdout/stderr is belwo.

2023-11-30 18:01:45,225 [INFO ] [dysgu-run] Version: 1.6.2 2023-11-30 18:01:45,225 [INFO ] run -x -p8 --sites mutant-vs-wildtype.vcf reference.fa ./tmp wildtype.bam 2023-11-30 18:01:45,225 [INFO ] Destination: ./tmp 2023-11-30 18:03:39,600 [INFO ] dysgu fetch wildtype.bam written to ./tmp/wildtype.dysgu_reads.bam, n=17391638, time=0:01:54 h:m:s 2023-11-30 18:03:39,601 [INFO ] Input file is: ./tmp/wildtype.dysgu_reads.bam 2023-11-30 18:03:39,662 [INFO ] Sample name: wildtype 2023-11-30 18:03:39,662 [INFO ] Writing vcf to stdout 2023-11-30 18:03:39,662 [INFO ] Running pipeline 2023-11-30 18:03:40,117 [INFO ] Calculating insert size. Removed 3 outliers with insert size >= 2187 2023-11-30 18:03:40,124 [INFO ] Inferred read length 151.0, insert median 99, insert stdev 205 2023-11-30 18:03:40,125 [INFO ] Max clustering dist 1124 2023-11-30 18:03:40,125 [INFO ] Reading --sites mutant-vs-wildtype.vcf 2023-11-30 18:03:40,148 [INFO ] Building graph with clustering 1124 bp 2023-11-30 18:13:31,832 [INFO ] Total input reads 17391638 2023-11-30 18:13:32,277 [INFO ] Added 99 variants from input sites 2023-11-30 18:13:32,650 [INFO ] Graph constructed 2023-11-30 18:13:32,651 [INFO ] Minimum support 3 2023-11-30 18:19:45,004 [INFO ] Number of components 12070167. N candidates 3391085 2023-11-30 18:19:45,194 [INFO ] Number of matching SVs from --sites 109 2023-11-30 18:33:38,783 [INFO ] Re-alignment of soft-clips done. N candidates 34699 2023-11-30 18:33:39,130 [INFO ] Number of candidate SVs merged: 2885 2023-11-30 18:33:39,130 [INFO ] Number of candidate SVs after merge: 31814 2023-11-30 18:33:39,169 [INFO ] Loaded n=9 chromosome coverage arrays from ./tmp 2023-11-30 18:33:48,265 [INFO ] Loading Model: /home/norman/miniconda3/envs/dysgu/lib/python3.10/site-packages/dysgu/dysgu_model.1.pkl.gz /home/norman/miniconda3/envs/dysgu/lib/python3.10/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator LabelEncoder from version 0.23.2 when using version 1.3.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations warnings.warn( 2023-11-30 18:33:48,486 [INFO ] Model config: pe, diploid: True, contig features: True. N features: 43 2023-11-30 18:33:54,382 [INFO ] dysgu call ./tmp/wildtype.dysgu_reads.bam complete, n=17929, time=0:30:14 h:m:s 2023-11-30 18:33:54,594 [INFO ] dysgu run wildtype.bam complete, time=0:32:09 h:m:s

kcleal commented 10 months ago

Hi @warthmann,

The documentation is not very clear, apologies. When you filter against a normal sample, the calls returned should be unique to the mutant line. In theory this should mean all the sites in the normal are 0/0. In practise, some sites may have low support. When you supply a --sites file, the whole genome is analysed, not just the regions around the input sites. Any variant in the --sites file will make SV detection more sensitive around those sites, but all SVs from the genome will still be reported, which is consistent with your output. I think the behaviour you were expected was a bit different - using dysgu to only re-genotype a file (rather than discovery as well). If you only want re-genotyping around the input sites, you can provide a bed file of regions to search using the --search option. Hope that helps

warthmann commented 10 months ago

hello, thank you for this clarification. Regarding the --search flag in dysgu run. I can parse the required positions from the vcf. converting this into a .bed file, will I need to pad the positions in some way or can you recommend using the (same) position of the variant as BEGIN and END in the .bed? Alternatively, can I suggest including a feature similar to --all-sites for dysgu --filter? I.e., such that upon set flag dysgu --filter returns a vcf that also contains entries for a list of specified samples, or all samples for which a bam file is provided? thanks a lot

kcleal commented 10 months ago

That is a good suggestion, I can try and add that in the next release.

In the meantime, you will need to parse the dysgu vcf to extract the regions surrounding each breakpoint. I recommend using pysam for this. The pad should be similar to the insert size: Pseudo code would be along the lines

import pysam

vcf = pysam.VariantFile('dysgu.vcf')
outbed = open('out.bed', 'w')
insert = 200

for v in vcf.fetch():
    outbed.write(f"{v.chrom}\t{v.pos - insert}\t{v.pos + insert}\n")
    if r.info['SVTYPE'] != 'TRA':
        outbed.write(f"{v.chrom}\t{v.end - insert}\t{v.end + insert}\n")
    else:
        outbed.write(f"{v.info['CHR']}\t{v.info['CHR2_POS'] - insert}\t{v.info['CHR2_POS'] + insert}\n")

Dysgu will sort and merge intervals as needed.