hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
51 stars 16 forks source link

Trouble with VCFs #10

Open valflanza opened 3 years ago

valflanza commented 3 years ago

Using GSAlign with several genomes against the same reference I've found that some SNPs (same contig, same position) have different reference sequence.

Example:

GCA_002119955.1_KP024.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . C A 100 TYPE=SUBSTITUTE GCA_002271285.1_NU-CRE265.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . C A 100 TYPE=SUBSTITUTE GCA_002850885.1_COL-Kpn34.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . G T 100 TYPE=SUBSTITUTE GCA_002850935.1_COL-Kpn33.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . G T 100 TYPE=SUBSTITUTE GCA_002851435.1_COL-Kpn35.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . C A 100 TYPE=SUBSTITUTE GCA_002851555.1_COL-Kpn21.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . G T 100 TYPE=SUBSTITUTE GCA_002854015.1_COL-Kpn110.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . C A 100 TYPE=SUBSTITUTE GCA_002854575.1_COL-Kpn58.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . C A 100 TYPE=SUBSTITUTE GCA_002855255.1_COL-Kpn62.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . C A 100 TYPE=SUBSTITUTE GCA_002855295.1_COL-Kpn59.fna.output.vcf:gnl|N|EuSCAPE_TR282_1 2576 . C A 100 TYPE=SUBSTITUTE

hsinnan75 commented 3 years ago

Hi, I think it is because they are due to alignments with different orientation. Some are with forward strand, and some with reverse. I'll fix this problem. Thank you!

erya-song commented 2 years ago

Hi, Dr. Hsin-Nan Lin, have you fixed this problem yet? The same problem to me.

Joseph-Vineland commented 2 years ago

I used the following script to correct said error:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use fixRefVCF.pl <GENOME_FASTA> <VCF> <FIX_VCF>\n";

my $genome = shift @ARGV;
my $orig_vcf = shift @ARGV;
my $new_vcf = shift @ARGV;

my %seq = ();
my ($chr, $pos, $ref, $alt, $obs);

warn "loading genome sequence from $genome\n";
open (my $fh, "<", $genome) or die "cannot read $genome\n";
while (<$fh>) {
    chomp;
    if (/>(\w+)/) {
        $chr = $1;
    } else {
       $seq{$chr} .= $_;
    }
}
close $fh;

open (my $vh, "<", $orig_vcf) or die "cannot read $orig_vcf\n";
open (my $oh, ">", $new_vcf) or die "cannot write $new_vcf\n";
while (<$vh>) {
    if (/^#/) {
         print $oh $_;
    } else {
        chomp;
        my @var = split (/\t/, $_);
        $chr = $var[0];
        $pos = $var[1]; # VCF is 1-based, Perl string is 0-based
        $ref = uc $var[3];
        $alt = uc $var[4];
        $obs = uc substr($seq{$chr}, $pos - 1, length $ref);
        if ($ref ne $obs) {
            warn "editing $chr:$pos $ref->$obs\n";
            $var[3] = $alt;
            $var[4] = $ref;
        }
        print $oh join "\t", @var;
        print $oh "\n";
    }
}
close $vh;
close $oh;