samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

bcftools sort problem? #2103

Closed SZ-qing closed 5 months ago

SZ-qing commented 5 months ago

I have a vcf file where the chromosome numbers are not in order, so I want to sort the chromosomes in order to build an index using tabix. but there is a problem where the POS columns start counting from 1after using the sortfunction, which leads to a disorder in the RSPOS information. Command line: bcftools sort --temp-dir ./tmp/ GCF_000001405.39.renamed.vcf -o GCF_000001405.39.renamed.sorted.vcf

Before sort data: image

After sort data: image

My bcftools version is : 1.18-15-g21755519 (using htslib 1.18)

pd3 commented 5 months ago

Uh, that's very odd indeed. Can you share a test case to reproduce the problem, please?

SZ-qing commented 5 months ago

Uh, that's very odd indeed. Can you share a test case to reproduce the problem, please?

VCF file was from dbSNP database (version is b155): wget https://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_000001405.39.gz The chromosome ID inside this file is in Refseq format, and I need to convert it to regular format such as 1, 2, 3, etc., so I downloaded the corresponding conversion data from NCBI, and used bcftool annotate to do the ID conversion, and at this time, all the POS information is normal: bcftools annotate --rename-chrs processed_id_data.txt GCF_000001405.39.gz -o GCF_000001405.39.renamed.vcf
processed_id_data.txt: image GCF_000001405.39.renamed.vcf:
image

The next step is to sort using bcftool sort:
bcftools sort --temp-dir ./tmp/ GCF_000001405.39.renamed.vcf -o GCF_000001405.39.renamed.sorted.vcf GCF_000001405.39.renamed.sorted.vcf: image

At this point the POS information has been encoded from 1.

SZ-qing commented 5 months ago

Uh, that's very odd indeed. Can you share a test case to reproduce the problem, please?

I'm very sorry, I found that there are multiple pos info corresponding to the same RS id in these dbSNP data, so bcftool is fine, thanks!