Open Nicholas-Kron opened 9 months ago
I'm not entirely sure where the issue is. But I suspect it is certainly the update
step. It might be a bug or as you mentioned an artifact of how you ran it. But I assume the log files from predict
when it ran tbl2asn did not have these errors. So it could be that the code parsing the PASA GFF3 output didn't work as expected.
Hi Jon,
Yes, you are correct the predict step did not produce these errors. I ended up rerunning the whole pipeline from scratch, using input short and long reads from train
all the way to update
. Unfortunately, even in a clean run without mixing database types, inputs, or computers I still get the same problems (I suppose on the bright side this suggests that your pipeline is robust to my shenanigans).
Running the full pipeline from predict to update cleanly did not fix the UTR annotation issue, it is still really low despite 90% having 5 and 3 prime UTRs in the PASA db. As you mentioned, the predict
tabl2asn lacks any remarks on FEATURE_LOCATION_CONFLICT
, only 6 OVERLAPPING_GENES
, or FIND_OVERLAPPED_GENES
; of which the update
tabl2asn out has hundreds. Looking more at the update
table2asn another few errors, like BAD_LOCUS_TAG_FORMAT
, which further points to something happening during update
like you say.
When I actually looked at the FEATURE_LOCATION_CONFLICT
genes, none of the mRNA isoforms or exons are actually outside the bounds of the gene. As far as the OVERLAPPING_GENES
, looking at those reveals a bunch of weird genes that are up to a few megabases and overlap several smaller genes that downstream receive similar annotations. So it seems like update
is inserting lots of new overlapping genes? update
also gives me a new file for gene models that need fixing that now has 9 genes with multiple internal stops, none of which correspond to the problematic overlapping genes.
Not really sure how to proceed. I suppose I could forgo the update
step but I don't like the idea of having incomplete annotations. Thanks for your time on this!
Here are the errors for the new full run of the pipeline at update
:
Opsanus_beta_Bic.stats.json Opsanus_beta_Bic.pasa-reannotation.changes.txt Opsanus_beta_Bic.models-need-fixing.txt Opsanus_beta_Bic.error.summary.txt Opsanus_beta_Bic.discrepency.report.txt funannotate-update.log
Okay, I think probably this function is the issue: https://github.com/nextgenusfs/funannotate/blob/4d8e196295f24535a3e5fb0149aadcd66e5c032a/funannotate/update.py#L1517
It tries to parse the strange IDS/format from PASA and generate proper gene models. But it seems that at least one locus_tag slipped through and likely perhaps some other errors. The overlapping genes is unlikely a "fatal" error. I would like need to access to the intermediate results in order to try to fix.
Makes sense. Would a tarball of the update_misc folder be sufficient?
Yeah that should probably work.
Here is the tarball of the update_misc folder (uploaded to google drive). I didn't include the tbl2asn files since I figured those would be regenerated anyway, I hope that is OK. If not I can re-compress and upload again. Thanks again for your help!
Are you using the latest release?
Describe the bug
Not a bug per se, but in the discrepancy report from tbl2asn I am getting a large number of feature location conflicts. 4822 genes out of 41,076 genes (38,994 mRNAs), or almost 12%. I reached out to the NCBI and they suggested the only real solution was to manually extend the gene boundaries to match the mRNAs. I can write a script to parse the discrepancy report and fix those boundaries, but that seems like quite a lot of discrepancies? How does this sort of mismatch happen? Could some of the inputs or my process also contribute?
I ran predict using a gff3 generated by an external SQLite PASA run that only used IsoSeq long read data from the same individual the genome assembly was built. I then ran update with the same IsoSeq transcripts and mRNA short reads from the same species available at the SRA using a MySQL PASA DB built with the same inputs as the SQLite one. I ran predict on an HPC that doesn't have MySQL and update on a local machine that does have MySQL to save time (took 2 weeks for a colleague update to run single threaded on the HPC). Could that be part of the problem? I assumed this wouldn't be a problem since in my understanding that in update a new transcriptome is built and compared to the one provided to predict.
I wonder if this is also perhaps related to the fact that while PASA marks 40,806 out of 48,305 CDS as complete, after funannotate update only 17,200 out of 38,994 CDS had both 5' and 3' UTRs and 20,928 had no UTR annotation at all? The IsoSeq reads have a 96% alignment rate to the assembly, while the short reads only 88%. I did notice that running update with long reads and short reads leads to only ~650 of the long reads to align, whereas with long reads alone all ~140,000 align in the trinity step. I figured this was a RAM issue in Trinity.
I am rerunning update with more RAM and predict with the new PASA gff3 from the mysql run to rule out those as contributing. I appreciate any advice you may have. I am reluctant to go all the way back and run train locally due to time considerations, but if it needs to happen I will try it.
What command did you issue?
for predict:
for update:
Logfiles currently rerunning the update with more RAM, will provide when finishes.
OS/Install Information
funannotate check --show-versions
For Predict (HPC)
For Update (local)
both installs of funannotate were done the same way, at the same time, and parameterized them same.