stat-lab / MOPline

Detection and genotyping of structural variants
MIT License
15 stars 5 forks source link

supplementing missing calls(SMC) step #12

Closed jingydz closed 4 months ago

jingydz commented 6 months ago

Hi, when I running the smc step, I find there are some warnings in the 'smc_out.vcftool.log' file.

$ grep "arning" smc_out.vcftool.log Warning: Input is unsorted, results may not be complete. Warning: Input is unsorted, results may not be complete. Warning: Input is unsorted, results may not be complete. Warning: Input is unsorted, results may not be complete. Warning: Input is unsorted, results may not be complete. ...

The error message indicates that my joint/jointcall.All-samples.vcf file is not sorted, but upon inspection, the file appears to be sorted. I also used the vcf-sort software to sort the file again, resulting in the joint/jointcall.All-samples.sort.vcf file, which is identical to the original. However, when running the command mopline smc -v joint/jointcall.All-samples.sort.vcf -ts 7tools -od smc -p smc_out --build 38 -n 6, I still receive these warnings. Why is this happening?

jingydz commented 6 months ago

I understand that during the filtering step, some structural variation (SV) sites are removed (approximately 3,000), but why do I find that a small number of SV sites (about 1,000) are also deleted after running smc?

jingydz commented 6 months ago

joint/jointcall.All-samples.sort.vcf chr1 1565631 . . . PASS SVTYPE=DUP;SVLEN=97;END=1565727;SC=233;AF=0.2126;AC=233;AN=1096;DPR=1.56;SR=0.37 GT:GQ:VP:VL:DR:DS:SR ./.:0:1565630:97:1.35:0:0.39 0/0:0:0:0:0:0:0

smc/smc_out.vcf chr1 1565631 . . . PASS SVTYPE=DUP;SVLEN=97;END=1565727;SC=503;AF=0.6387;AC=700;AN=1096;DPR=1.39;SR=0.38;DPS=0.14 GT:GQ:VP:VL:DR:DS:SR:MC 1/1:39:1565630:97:1.35:0:0.39:0 1/1:0:1565631:97:1.25:0.5:0.36:3

I am also puzzled about the process of smc. For the ./. type, I can understand why genotypes exist, but why does 0/0 also become 1/1? Or why does 0/0 become 0/1?

stat-lab commented 6 months ago

In the joint call step, the SV genotype information is derived from the selected confident genotypes of SV/genotype calling tools used in the initial SV detection step. So, the genotypes of many SVs are undefined as './.'. In the SMC step, all SV genotypes are evaluated using the information for DPR, DR, DS, and previously defined genotypes. So, '0/0' and './.' genotypes defined in the joint call step could be '0/1' or '1/1' in the SMC step. SV sites with '0/0' genotype for all samples or with unbalanced 5'- and 3'-split ends for > 50% samples are deleted in the SMC step.

jingydz commented 6 months ago

Hi, w hen I run smc to recover missing values, I always run out of memory (the running node has 2TB of memory). Can I split the samples or chromosomes in some way?

mopline smc -v joint/jointcall.All-samples.sort.vcf -ts 7tools -od smc -p smc_out --build 38 -n 20

29G Mar 8 17:41 jointcall.All-samples.sort.vcf

command: smc Args: -v joint/jointcall.All-samples.sort.vcf -ts 7tools -od smc -p smc_out --build 38 -n 20 mopline smc: /xxx/software/MOPline/scripts/genotype_SV_SMC_7.4.pl -v joint/jointcall.All-samples.sort.vcf -ts 7tools -od smc -p smc_out --build 38 -n 20 Total sample: 8431

Total length of repeats: 295768357

1st step completed: 17: Sample re-searching completed: ... 2nd step completed: Missing calls (MC): 1851419577 MC recovered from the original SV calls: 26847495 3rd step completed: Out of memory!

May I ask why this is happening?

stat-lab commented 6 months ago

Your input vcf file in the SMC step seems very large. So, you should conduct the SMC step for each chromosome by specifying the -c option. It would be better to run the jobs for each cromosome in batch or with a job management system.

jingydz commented 6 months ago

fileformat=VCFv4.0

fileDate=20240202

source=MOPline-1.7

reference=hs38

INFO=

INFO=

INFO=

INFO=

ALT=

ALT=

ALT=

ALT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT= 0.8 for DEL, < 1.1 for DUP) in DEL and DUP">

FORMAT=

FORMAT=

FORMAT=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

The content above is the header from the file smc/annotate.AS.annot.filter.vcf that I obtained after running mopline. I would like to ask why the header does not include information for 'END', 'SC', 'AF', 'DPR', 'SR', etc., which is causing me to be unable to process it with VCF processing software. Can I add such parameters to the header myself? for example:

INFO=

INFO=

INFO=

INFO=

INFO=

stat-lab commented 6 months ago

Sorry for your trouble. A bug in the merge_SV_calls_ALLtype.pl script was fixed. This ploblem can be avoided using the updated scripts, but runs from step-1 or SMC-step are required. So, you may add the following header lines to the header of your final vcf file.

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=