statgen / Minimac4

GNU General Public License v3.0
54 stars 17 forks source link

Inconsistent ploidy and segmentation fault when creating 1000GP chrX.PAR[1-2].msav #74

Open youngchanpark opened 3 months ago

youngchanpark commented 3 months ago

Hi,

I split the 1000GP chrX VCF into nonPAR, PAR1, and PAR2 regions then tried creating .msav files with these but I encountered some issues. Below is the code I used.

#/usr/bin/env bash
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz.tbi

# Borders for nonPAR checked manually
# PAR1 - nonPAR border variant
# chrX    2781514
# nonPAR PAR2 border variant
# chrX    155700882
# The above two variants are nonPAR variants. So the nonPAR vcf should start and end with these two variants included.

echo -e "chrX\t2531036\t2531036\tHGSV_248479\nchrX\t2777488\t2777488\tHGSV_249395" > exclude_ids.bed
# these two <DUP> and <DEL> variants in PAR1 sneaks into the nonPAR if we don't specifically exclude
bcftools view -r chrX:2781514-155700882 -T ^exclude_ids.bed 1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz -Oz -o chrX.nonPAR.vcf.gz
tabix 1kGP_high_coverage_Illumina.chrX.nonPAR.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

bcftools view -r chrX:1-2781513 1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz -Oz -o chrX.PAR1.vcf.gz
tabix chrX.PAR1.vcf.gz

bcftools view -r chrX:155700883- 1kGP_high_coverage_Illumina.chrX.filtered.SNV_INDEL_SV_phased_panel.v2.vcf.gz -Oz -o chrX.PAR2.vcf.gz
tabix chrX.PAR2.vcf.gz

# chrX nonPAR (OK)
minimac4 --compress-reference chrX.nonPAR.vcf.gz > chrX.nonPAR.msav

# chrX PAR1 (Inconsistent ploidy)
minimac4 --compress-reference chrX.PAR1.vcf.gz > chrX.PAR1.msav

# chrX PAR2 (Segfault)
minimac4 --compress-reference chrX.PAR2.vcf.gz > chrX.PAR2.msav

The outputs of minimac4 were:

minimac v4.1.6

Error: sample ploidy is not consistent
Error: compressing variant failed
minimac v4.1.6

Mean Compression Ratio: 0.0399954
Min Compression Ratio: 0.0240718
Max Compression Ratio: 0.0915209
minimac v4.1.6

./create_xchr_msavs.sh: line 32: 3385400 Segmentation fault      (core dumped) 

msav for the nonPAR region was created as expected.

All samples in the PAR1 and PAR2 vcfs are either haploid or diploid for all variants. I assumed this would qualify as consistent ploidy, but PAR1 failed with inconsistent ploidy message. Is my understanding wrong? Does all samples within the VCF need to be either haploid or diploid?

I'm also getting a Segmentation fault for PAR2.

jonathonl commented 3 months ago

Thanks for this thorough report. When you extract nonPAR and PAR2 from this VCF, you are getting some structural variants that overlap the PAR boundaries. This is causing mixed ploidy. The PAR1 compression seems to be running fine. You will need to add -i 'POS>=2781514 to the nonPAR bcftools command and -i 'POS>=155700883' to the PAR2 bcftools command.

I will need to fix the segfault so that it instead prints an error message.