pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
774 stars 274 forks source link

Potential Bug: Miss Assigned Contig name for variants by VariantFile #1210

Open gudeqing opened 1 year ago

gudeqing commented 1 year ago

Hi, I am using the latest pysam with version 0.21.0, and encountered a bug in the following story:

  1. I got the Vcf output from Vardict(1.8.3), and the contigs information are missed.
  2. I wrote a simple python script to add contig information into the vcf:
    # update header
    for line in contig_info:
        if type(line) == list:
            header.contigs.add(line[0], length=line[1],)
        else:
            header.add_line(line)
    # write out new vcf
    vcf_out = pysam.VariantFile(out, "w", header=header)
    vcf_out.write(record)
    for r in vcf:
        vcf_out.write(r)
    vcf_out.close()
  3. Bctools reported errors such as b"Reference allele mismatch at 1:25469502 .. REF_SEQ:'G' vs VCF:'C'\n"
  4. Thus I checked the vcf and found that some contig names were wrongly assigned by pysam.VariantFile, such as "2 -> 1" or "10 ->12".
  5. Howerver, I never met this problem using the same code days ago.
  6. I realized that this is a special case, that is my vcf is from a small panel which only involves a few contigs.
  7. Potential Bug here: My vcf did not involve all contigs (such as contig "6" is not contained) and when I added all the human contigs into the vcf, pysam may wrongly update the contig names of records and output the wrong result.

Best regards!