malonge / RagTag

Tools for fast and flexible genome assembly scaffolding and improvement
MIT License
470 stars 47 forks source link

`updategff` produces wrong output if parts of the scaffolds are used in the AGP file #188

Open taprs opened 1 month ago

taprs commented 1 month ago

Hi!

Not sure if that was the indended use of the updategff command (or even the AGP format), but I was trying to update annotation based on an AGP file where I introduced a break into the chromosome and got corrupted GFF file. Minimal example follows:

echo $'## gff-version 3.1\nbibaboba\ttest\tgene\t5\t6\t.\t+\t.\tID=sas' > example.gff
echo $'## agp-version 2.1\nbiba\t1\t4\t1\tW\tbibaboba\t1\t4\t+\nboba\t1\t4\t1\tW\tbibaboba\t5\t8\t+' > example.agp

$cat  example.gff

## gff-version 3.1
bibaboba        test    gene    5       6       .       +       .       ID=sas

$cat example.agp

## agp-version 2.1
biba    1       4       1       W       bibaboba        1       4       +
boba    1       4       1       W       bibaboba        5       8       +

Obviously, positions 5 and 6 of bibaboba should then be converted into positions 1 and 2 of boba. But what I get is

$ ragtag.py updategff example.gff example.agp
Fri Sep 13 11:10:44 2024 --- VERSION: RagTag v2.1.0
Fri Sep 13 11:10:44 2024 --- CMD: ragtag.py updategff example.gff example.agp
## gff-version 3.1
boba    test    gene    5       6       .       +       .       ID=sas
Fri Sep 13 11:10:44 2024 --- INFO: Goodbye

The coordinates do not get transformed.

For completeness, I attach the solution that leaves me with the correct coordinates (for BED files though):

#!/usr/bin/gawk -f
 BEGIN{
 help="\
    This script updates a BED file using the transformation defined by an AGP file. \n\
    That is, BED in coordinates of contigs gets transformed to coordinates of scaffolds. \n\
    Pleaze bgzip and tabix the BED file before operation. \n\
    EXAMPLE: updategff.awk file.agp file.bed.gz > file_transform.bed"
   if (ARGC < 2) {print help; exit 1}
   OFS="\t"
 }

 # Exploit tabix to query bed files

 NR==FNR && !/^#/ && $5=="W" {
   cmd="tabix " ARGV[2] " " $6":"$7"-"$8
   while ( cmd | getline bedline > 0 ) {
    split(bedline, bed)
    if (bed[2] < $7 || bed[3] > $8) { next }
    if ($9=="+"){
     printf "%s\t%s\t%s", $1, bed[2]-$7+$2, bed[3]-$7+$2
    } else {
     printf "%s\t%s\t%s", $1, $8-bed[3]+$2, $8-bed[2]+$2                                                                                                                           }
   if (length(bed) > 3) {
     for (i=4;i<=length(bed);i++) {
      printf "\t%s", bed[i]
      print ""
     }
    }
   }
   close(cmd)
 }