gmbecker / genbankr

http://bioconductor.org/packages/devel/bioc/html/genbankr.html
14 stars 9 forks source link

Join issue #20

Open fpietluch opened 1 year ago

fpietluch commented 1 year ago

Hi, I am parsing many of genbank files. I have to trim sequence by coordinates (START,END). If situation look like: 3447..3578 or complement(3102..3323) or join(3447..3578,3447..3578) or complement(join(3102..3323,3500..3722)), everything is fine. The problem begins when a gene copy is placed in both strands marked in file like:

CDS join(61784..61897,complement(99364..99595), complement(98797..98822)) /gene="rps12" /locus_tag="AZ333_gp051" /trans_splicing /note="trans splicing of 5'rps12 exon and 3'rps 12 exon"

The + strand is lost (61784..61897), first complement is right and second complement ( which is placed in new line) is without start coordinate and R returns warnig : Warning message: In FUN(X[[i]], ...) : NAs introduced by coercion

How deal with it? Maybe, is it a way to edit a genbank file to get coordinates properly?

fpietluch commented 1 year ago

There is also bad reading of : join(complement(74020..74133),145788..146019, 146556..146581)

all are marked as "-" while only first one is. It is tremendous mistake without error.