rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

sorted and duplicated position #23

Open fengxiuming opened 1 year ago

fengxiuming commented 1 year ago

Hello, I have problem about the sorted and duplicated position when conduct QUILT_prepare_reference.R My code is: (according to UILT_hla_reference_panel_construction.Md) oneKG_vcf_name=CCDG_14151_B01_GRM_WGS_2020-08-05_chr6.filtered.shapeit2-duohmm-phased.vcf.gz

bcftools view --output-file oneKG234.temp.vcf.gz --output-type z --min-alleles 2 --max-alleles 2 --types snps ${oneKG_vcf_name} chr6:25000000-34000000

bcftools convert --haplegendsample oneKG oneKG234.temp.vcf.gz

~/QUILT/project/QUILT/QUILT_prepare_reference.R \ --outputdir=quilt_output \ --chr=chr6 \ --nGen=100 \ --reference_haplotype_file=oneKG.hap.gz \ --reference_legend_file=oneKG.legend.gz \ --genetic_map_file=~/QUILT/database/lifeover/CEU/CEU-chr6-final.b38.txt.gz \ --regionStart=25000000 \ --regionEnd=34000000 \ --buffer=10000

Error in validate_pos(decoy_legend, chr = chr, stop_file_name = paste0("The reference legend file ", : The reference legend file oneKG.legend.gz column 2 needs to be sorted on position with increasing positions between rows but row number 75 has position 25003153 and row number 76 has position 25003153

And I also tried to sort and remove duplicated position according to bcftool.html

bcftools sort oneKG.temp.vcf.gz --output sorted.vcf.gz bcftools norm sorted.vcf.gz -d none --output filter.vcf.gz

And the QUILT_prepare_reference.R still failed to prepare reference because: column 2 needs to be sorted on position with increasing positions between rows but row number 75 has position 25003153 and row number 76 has position 25003153.

What should I do? I would be very grateful if you could help me.

Of course, I have downloaded the reference data from https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html which have available data about the reference data about hap, legend, sample file. But the error still exist.

fengxiuming commented 1 year ago

the code is: < ~/QUILT/project/QUILT/QUILT_prepare_reference.R \ --outputdir=quilt_output \ --chr=chr6 \ --nGen=100 \ --reference_haplotype_file=oneKG.hap.gz \ --reference_legend_file=oneKG.legend.gz \ --genetic_map_file=~/QUILT/database/lifeover/CEU/CEU-chr6-final.b38.txt.gz \ --regionStart=25000000 \ --regionEnd=34000000 \ --buffer=10000>

fengxiuming commented 1 year ago

Sorry, I do not why the code was crossed

atrigila commented 6 months ago

Hi @rwdavies, I have encountered a similar error:

The reference legend file test.legend.gz  column 2 needs to be sorted on position with increasing positions between rows but row number 231 has position 176121 and row number 232 has position 176121

In my case, the error is related to the reference legend file (test.legend.gz), specifically mentioning an issue with a duplicate position for a multi-allelic SNP.

The legend file contains entries like:

11:176121_A_G 176121 A G
11:176121_AC_A 176121 AC A

It seems that the tri-allelic SNPs have been recoded as bi-allelic SNPs, causing a duplicate position in the legend file. What would you suggest is the best approach for these cases? I was thinking of simply removing all duplicated coordinates in the VCF, but perhaps there is a better solution.

Thank you!

rwdavies commented 6 months ago

Yes apologies, at this moment there is no better way to handle it than either remote the site entirely, or choose only a single variant to consider. It would be possible to re-write the code to handle this case, but I sadly don't have the bandwidth to make that change

atrigila commented 6 months ago

No worries! Thank you for your prompt reply!