NCBI error: CDS not contained within cross-referenced mRNA

lixx3627 commented 5 years ago

Describe the bug Hi Jon, when I'm doing NCBI genome submission, Validation error picked up an error like so: For example, there are ~400 cases of "SEQ_FEAT.CDSmRNAXrefLocationProblem" in one -500 cases in my genome submissions, which are almost all gene models that have alternative transcripts:

ERROR: valid [SEQ_FEAT.CDSmRNAXrefLocationProblem] CDS not contained within cross-referenced mRNA FEATURE: CDS: <234> [(lcl|tig00001981:1169051-1169221, 1169300-1169575)] [lcl|tig00001981: raw, dna len= 3019403] -> [gnl|ncbi|XXXXX_033258-T2]

ERROR: valid [SEQ_FEAT.CDSmRNAXrefLocationProblem] CDS not contained within cross-referenced mRNA FEATURE: CDS: <208> [(lcl|tig00001981:c1062723-1062571, c1062463-1062376, c1062291-1062199, c1062082-1061962, c1061800-1061370, c1061273-1061035, c1060954-1060858, c1060665-1060358)] [lcl|tig00001981: raw, dna len= 3019403] -> [gnl|ncbi|XXXXX_032286-T1]

.......etc (400 - 500 scenarios like so)

When I was debugging the issue, I found it was actually the coordinates of CDs and exons of the alternative transcripts were swapped: i.e., XXX-T1 exons coordinates actually covers the XXX-T2 CDS, XXX-T2 exons coordinates actually covers the XXX-T1 CDS. So wherever CDs happened to be assigned to a shorter exon which alternative transcript gave an error "CDS not contained within cross-referenced mRNA ". It looks like so in GenomeView:

I looked back where this issue could happen, and found this issue existed since funannotate update. I wonder if you know about this issue and could help.

Thanks, Feng

nextgenusfs commented 5 years ago

Hi Feng, I've mentioned this to NCBI staff, the fundamental problem is that GenBank format does not have a transcript_id in their mRNA field and tbl2asn does not seem to write transcripts in an order that is reproducible, this means that with multiple transcript/alt transcript gene models, there is no way to properly parse them from the GenBank file. The files should be correct in the GFF3 output as well as the tbl output from funannotate update, but there is no guarantee that in the GBK output that the transcripts are properly paired with their protein CDS features. In older versions of funannotate, I was parsing the GenBank file to generate the input files for funannotate annotate, it should now be grabbing the TBL file to use for adding functional annotation. Alternatively you could pass a GFF3 and FASTA file to funannotate annotate, but the problem is the GBK format.

Code is here: https://github.com/nextgenusfs/funannotate/blob/master/bin/funannotate-functional.py#L510-L553

This means that if you pass only a GBK file to funannotate annotate that has multiple transcripts per locus, it is most likely going to be incorrect. However, if you pass a funannotate output folder to funannotate annotate, it will properly determine that you have run funannotate update and pull the TBL file from the proper place and reuse it. The TBL file is where the functional annotation gets added and also is the input to tbl2asn, i.e. ultimately the genbank submission.

You can make sure this is the error by looking at the GFF3 output from funannotate update in your genome browser, you should see this same locus has properly paired transcripts/CDS features.

Let me know if you have already run funannotate annotate as I describe above and you still see this behavior.

lixx3627 commented 5 years ago

Hi Jon, Actually, the GFF3 output from funannotate update has the same "swap" for the alternative transcripts as I described above. Here's how I ran funannotate update: funannotate update -i ./fun_folder --cpus 20 --pasa_config /scratch.global/username/funannotate/sqlite_tmp/210_pseudomolecule/alignAssembly.txt

So my guess is somehow the funannotate update part went wrong..

nextgenusfs commented 5 years ago

Okay, what version are you running?

lixx3627 commented 5 years ago

It was funannotate v1.5.0.

nextgenusfs commented 5 years ago

Okay, I'll look to see if any changes happened in update since v1.5.0. In the meantime, can you pull a few examples in GFF3 format? Would need the input, this would be the GFF3 models from funannotate predict and then the matching loci from funannotate update with the alternative transcripts. It would also help to get the intermediate models as well (i.e. output from PASA), this may be a little tricker, but I think you could use something like bedtools intersect to pull out same regions from the 3 sets of GFF3 files. I need to determine where in the pipeline the error is occurring.

lixx3627 commented 5 years ago

Sure I can pull out several examples. What are the 3 sets of GFF3 files? GFF3 from funannotate update and funannotate annotate, and which one? Thanks.

nextgenusfs commented 5 years ago

The 3 GFF3 files needed are:

Result from funannotate predict: output_dir/predict_results/species_genus_strain.gff3
Intermediate result: output_dir/update_misc/bestmodels.gff3
Result from funannotate update: output_dir/update_results/species_genus_strain.gff3

The raw output of PASA compare is in output_dir/update_misc/pasa_final.gff3 -- however, this is in an atypical format. So if the error is in the bestmodels.gff3 then I might need you to find the corresponding records in the PASA output file to see why they were getting parsed incorrectly.

nextgenusfs commented 5 years ago

And as I looked at the code, I don't see any changes specifically in the function that converts PASA GFF3 output, so if this is a bug it has been there for awhile.

But I did notice that I fixed a rather important bug for RNA-seq and funannotate update in v1.5.1 -- which was a mistake in the bam2gff3 function, the result was that crick (negative) stranded alignments were off resulting in bad alignments getting passed to PASA. I wrote this in the release notes. But I would definitely upgrade versions if you can. If this is a bug and I can reproduce with the current code, then I'll fix it first and then you can update versions.

lixx3627 commented 5 years ago

Thanks Jon for the info. I'll try to pull out of those gff3 file.

I don't think we'll be able to upgrade and use v1.5.1 at the moment as we are already in the last step of our paper submission, and so far our downstream analysis didn't really involve this alternative transcripts CDs/exon issue. I can manually modify the naming of T1, T2 for the CDs and exons to match but apparently it'll be a painful process... Thanks!

nextgenusfs commented 5 years ago

Okay, well if it is a bug I think should be able to just fix that on your system locally, re-run (which will just re-use the data), and hopefully output proper paired transcripts/CDSs.

lixx3627 commented 5 years ago

Hi Jon, Sorry for the late reply. I'm trying to figure out where could go wrong by comparing those gff3 files but didn't find specific patterns. Here I'm attaching two examples (two genes with alternative transcripts) that had this "CDS not contained within cross-referenced mRNA". I took two examples from "intersect3gff.gff3" file which was generated by bedtools intersect -wa -wb -a update_results/update.gff3 -b predict_results/predict.gff3 update_misc/bestmodels.gff3 -names predict bestmodels > intersect3gff.gff3

For example, in example 1: the mRNA, exon, CDS of XXXXX_033258-T1 from funannotate update seems sharing the same region of CDS, exon, UTR from both XXXXX_033258-T1 and XXXXX_033258-T1.1.5c425f17 from the bestmodels.gff3(I suppose T1.1.5c425f17 from bestmodels was named as T2 in 'update' ).

I'm not sure how to better compare them to figure out where went wrong. Could you please give a bit more instructions to debug this? Please let me know if you need more info.

Thank you so much for your help!

Feng

lixx3627 commented 5 years ago

Maybe this is more intuitive in Genome Browser. Please find the genome browser images for these two examples, and we can see it looks like bestmodels.gff3 was still good, with longer exons matching the longer mRNA, which was later wrong in funannotate update gff3 result.

nextgenusfs commented 5 years ago

Okay, I'm a little bit confused by the output as well, I guess what I meant by bedtools intersect would be something more like this:

echo -e "tig00001981\t1169027\t1170662" > region1.bed
bedtools intersect -a predict.gff3 -b region1.bed -wa > fun_predict_region1.gff3
bedtools intersect -a bestmodels.gff3 -b region1.bed -wa > bestmodels_region1.gff3
bedtools intersect -a update.gff3 -b region1.bed -wa > fun_update_region1.gff3'

I want to see what the coordinates/naming is for each step to see where the error is. It looks like in my bleeding edge code I've updated the tbl writing function for some additional functionality, so perhaps that is where the "error" is.

lixx3627 commented 5 years ago

Thanks for the info. Here is the output of those gff3 files. I think we can see in funannotate update result, it has swapped T1 and T2 but it was good in "bestmodels", right?

nextgenusfs commented 5 years ago

Okay, I think I have an idea of where it is happening, although I won't know for sure until I test it. Would you be able to send the data (can be via email or a link)? I will need the bestmodels GFF3 file, the genome FASTA, and the the trnascan GFF file (should be update_misc/genome.trna.gff3). Basically the error seems to likely be in the conversion of the GFF file to the tbl file in this function: https://github.com/nextgenusfs/funannotate/blob/master/bin/funannotate-update.py#L1024

So assuming I can fix this bug, are you going to be able to upgrade the version of funannotate on the cluster? Since funannotate annotate uses only protein models and those aren't effected with this bug, only the pairing of transcripts to those protein models, I think you should be able to just run the same commands over the top of existing data (with the fixed function) and it should correct it (I hope).

lixx3627 commented 5 years ago

Hi Jon,

I'll send you the data through email as they are still confidential. Thank you so much for your help! "upgrade the version of funannotate on the cluster"? You mean after you fix the bug of "funannotate update", I run "funannotate annotate" in funannotate v1.5.1 using the result from funannotate 1.5.0? Can I stick to v1.5.0 though after we fix this issue in "funannotate annotate"?

Thanks!

Feng

On Tue, May 21, 2019 at 11:55 AM Jon Palmer notifications@github.com wrote:

Okay, I think I have an idea of where it is happening, although I won't know for sure until I test it. Would you be able to send the data (can be via email or a link)? I will need the bestmodels GFF3 file, the genome FASTA, and the the trnascan GFF file (should be update_misc/genome.trna.gff3). Basically the error seems to likely be in the conversion of the GFF file to the tbl file in this function: https://github.com/nextgenusfs/funannotate/blob/master/bin/funannotate-update.py#L1024

So assuming I can fix this bug, are you going to be able to upgrade the version of funannotate on the cluster? Since funannotate annotate uses only protein models and those aren't effected with this bug, only the pairing of transcripts to those protein models, I think you should be able to just run the same commands over the top of existing data (with the fixed function) and it should correct it (I hope).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/290?email_source=notifications&email_token=AHLCXW3TQ7EY4QDZQMCPJQTPWQSO5A5CNFSM4HN3JBI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV4Q32Y#issuecomment-494472683, or mute the thread https://github.com/notifications/unsubscribe-auth/AHLCXW52TCM3QMLPSQR54ZTPWQSO5ANCNFSM4HN3JBIQ .

nextgenusfs commented 5 years ago

Let me get the issue fixed first and then see about how to get your project done without changing anything. Maybe I can just get you a script that will reformat the funannotate update ouptut and then you can run v1.5.0 annotate so nothing changes [except for the proper phasing of transcripts/proteins]. But after I get this fixed, then next release will be 1.5.4, so would be pertinent to upgrade the cluster copy at that time.

nextgenusfs commented 5 years ago

Can send to nextgenusfs at gmail dot com.

nextgenusfs commented 5 years ago

Okay, this is fixed with https://github.com/nextgenusfs/funannotate/commit/e1184313545e361b15d9b68a37fdcddffe1d3c4f. A few more fixes and I'm going to push v1.6.0 -- will be good to upgrade the copy on the cluster at that time.

lixx3627 commented 5 years ago

Great! Thanks for helping Jon. Now the .tbl and .sqn files have correct CDS/mRNA/exon matching.

On Sun, May 26, 2019 at 7:13 PM Jon Palmer notifications@github.com wrote:

Okay, this is fixed with e118431 https://github.com/nextgenusfs/funannotate/commit/e1184313545e361b15d9b68a37fdcddffe1d3c4f. A few more fixes and I'm going to push v1.6.0 -- will be good to upgrade the copy on the cluster at that time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/290?email_source=notifications&email_token=AHLCXW2FBLAZXTO72YKNWXDPXMRSVA5CNFSM4HN3JBI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWIQFYI#issuecomment-496042721, or mute the thread https://github.com/notifications/unsubscribe-auth/AHLCXW2G5EPJQKZRX6TPNPTPXMRSVANCNFSM4HN3JBIQ .

lixx3627 commented 5 years ago

Hi Jon,

As the .gff3 files were still wrong but you mentioned you "now have code to generate the gff3 files directly from tbl files", I wonder if by any chance I can find this code online or from you.

The reason is that now I have to separate one of my .sqn files into three sets, because for one isolate we have generated phased haplotypes and NCBI suggests to provide each pseudohaplotype as a separate genome (each as one "set"). I have contig names for each set of pseudohaplotype, and I feel it's easier to parse and separate it in .gff3 files than .sqn file?

Thanks a lot for your help!

Feng

On Mon, May 27, 2019 at 12:52 PM Feng LI lixx3627@umn.edu wrote:

Great! Thanks for helping Jon. Now the .tbl and .sqn files have correct CDS/mRNA/exon matching.

On Sun, May 26, 2019 at 7:13 PM Jon Palmer notifications@github.com wrote:

Okay, this is fixed with e118431 https://github.com/nextgenusfs/funannotate/commit/e1184313545e361b15d9b68a37fdcddffe1d3c4f. A few more fixes and I'm going to push v1.6.0 -- will be good to upgrade the copy on the cluster at that time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/290?email_source=notifications&email_token=AHLCXW2FBLAZXTO72YKNWXDPXMRSVA5CNFSM4HN3JBI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWIQFYI#issuecomment-496042721, or mute the thread https://github.com/notifications/unsubscribe-auth/AHLCXW2G5EPJQKZRX6TPNPTPXMRSVANCNFSM4HN3JBIQ .

MinjieHu commented 5 years ago

Hi Jon, The same problem happens to me. And I was using an even old version, 1.3.3. Do you have any instruction to solve the problem while not changing the gene ID and gene symbol when I run an updated funannotate version?

MinjieHu commented 5 years ago

I updated to funannoate 1.5.3 and just run funannotate annoate --fix fix.txt. It will update new sqn files. However, when I look into the gff file, I found some of the gene names were not updated according to the fix.txt file. Here's one exsample: Xe_019101 Granulin2 Granulin like (this is in the fix.txt file) while in the gff file, the annotation shows as: HiC_scaffold_24 GenBank gene 2790216 2792255 . + . ID=Xe_019101; HiC_scaffold_24 GenBank mRNA 2790216 2792255 . + . ID=Xe_019101-T1;Parent=Xe_019101;product=hypothetical protein;

nextgenusfs / funannotate

NCBI error: CDS not contained within cross-referenced mRNA #290