keylabivdc / VIP

25 stars 15 forks source link

"gi" headers deprecated by NCBI #14

Open tony-travis opened 4 years ago

tony-travis commented 4 years ago

The current NCBI fasta format uses "gb" (Genbank) headers, which are not compatible with vip_db_format.pl, which produces an empty /work/VIP/FAST/vip_fast.fa.formatted file. The following patch extracts the accessions instead of "gi" numbers:

diff -Naur vip_db_format.pl.dist vip_db_format.pl
--- vip_db_format.pl.dist   2020-01-29 12:45:01.243718605 +0100
+++ vip_db_format.pl    2020-01-30 23:04:16.116854880 +0100
@@ -1,4 +1,5 @@
 #!/usr/bin/perl -w
+#@(#)vip_db_format.pl  2020-10-30  last modified by A.J.Travis
 #
 #  vip_db_format.pl
 #
@@ -27,7 +28,7 @@

 while (<FL>) {
    chomp;
-   if (/.*(gi\|[0-9]*\|).*?\n(.*)/si) {
+   if (/.*(gb\|[^|]*\|).*?\n(.*)/si) {
    #if (/.*?(gi\|[0-9]*\|).*?\n(.*)/si) {
        my $gi = lc($1);
        my $seq = $2;