nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Identical CDS features and Splice Variant Locus Tag Assignment Question #985

Closed StephenHarding-USDA closed 7 months ago

StephenHarding-USDA commented 7 months ago

Hello,

When we include RNA data via the train module, as expected, we get multiple CDS features for each splice variant. However, they all share the same locus tag (see image below). This becomes a problem when we pass those data to antiSMASH via the funannotate remote module because antiSMASH will not complete a prediction on a dataset with multiple CDS features that have the same locus tag. In the image below, the protein_ids are unique, but the locus_tags are not.

I have searched the funannotate manual and may have missed it, but is there a way to configure funannotate to generate unique Locus Tag ID's for splice variants and assign them in the output?

Screenshot 2023-11-29 153420

Thanks in advance.

nextgenusfs commented 7 months ago

Hi @StephenHarding-USDA. The real issue here is that genbank format does not have a mechanism to link CDS to mRNA features, ie in GFF3 CDS and exons are children of an mRNA feature which is in turn a child of a gene features (locus_tag). However, in genbank format they are all children of gene features (locus_tag). So the output here is correct: all alt transcripts should be from the same locus_tag. The parsing problem in genbank format is that you have no way of matching up the CDS feature to the proper mRNA feature. Frustratingly more is that when tbl2asn creates genbank files it does not write them in a particular reproducible order, ie you cannot rely on the first mRNA feature will batch up with the first CDS features from a given locus.

Okay -- so how do you get this to work with antiSMASH. First I'd try to submit the FASTA+GFF3 from funannotate if possible -- you can test this on their web server I think. I haven't touched the funannotate remote script in a long time, so not sure if the reason it is dying is due to that issue or something else (I assume you have tried with some non alternative transcript containing GBK files via funannotate remote and those have worked? I do run genomes through antiSMASH v6 and v7 on our local "cluster" and while I get a few warnings on some loci that it can't parse, it does complete successfully.

You could also probably generate a genbank file with only -T1 transcripts for submission to antiSMASH -- the -T1 transcript are the most abundant at that locus and would be more than sufficient for antiSMASH.

StephenHarding-USDA commented 7 months ago

I feared that this would be the case and almost did not post my issue. But I am glad that I did because your answer validated my suspicion and helped me to understand the issue better. Thank you for the timely response.