nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

>10% of genes have RNA-Gene location discrepancies in DiscRep #963

Open Nicholas-Kron opened 9 months ago

Nicholas-Kron commented 9 months ago

Are you using the latest release?

funannotate --version
funannotate v1.8.15

Describe the bug

grep -ce "DiscRep_SUB:FEATURE_LOCATION_CONFLICT::RNA feature location does not match gene location" Opsanus_beta_Bic.discrepency.report.txt 
4822

Not a bug per se, but in the discrepancy report from tbl2asn I am getting a large number of feature location conflicts. 4822 genes out of 41,076 genes (38,994 mRNAs), or almost 12%. I reached out to the NCBI and they suggested the only real solution was to manually extend the gene boundaries to match the mRNAs. I can write a script to parse the discrepancy report and fix those boundaries, but that seems like quite a lot of discrepancies? How does this sort of mismatch happen? Could some of the inputs or my process also contribute?

I ran predict using a gff3 generated by an external SQLite PASA run that only used IsoSeq long read data from the same individual the genome assembly was built. I then ran update with the same IsoSeq transcripts and mRNA short reads from the same species available at the SRA using a MySQL PASA DB built with the same inputs as the SQLite one. I ran predict on an HPC that doesn't have MySQL and update on a local machine that does have MySQL to save time (took 2 weeks for a colleague update to run single threaded on the HPC). Could that be part of the problem? I assumed this wouldn't be a problem since in my understanding that in update a new transcriptome is built and compared to the one provided to predict.

I wonder if this is also perhaps related to the fact that while PASA marks 40,806 out of 48,305 CDS as complete, after funannotate update only 17,200 out of 38,994 CDS had both 5' and 3' UTRs and 20,928 had no UTR annotation at all? The IsoSeq reads have a 96% alignment rate to the assembly, while the short reads only 88%. I did notice that running update with long reads and short reads leads to only ~650 of the long reads to align, whereas with long reads alone all ~140,000 align in the trinity step. I figured this was a RAM issue in Trinity.

I am rerunning update with more RAM and predict with the new PASA gff3 from the mysql run to rule out those as contributing. I appreciate any advice you may have. I am reluctant to go all the way back and run train locally due to time considerations, but if it needs to happen I will try it.

What command did you issue?

for predict:

funannotate predict \
-i fOpsBet2.1_genomic.full_mask.soft.fa \
-o funannotate/predict \
--SeqCenter "UC Davis DNA Technologies & Expression Analysis Core" \
--species "Opsanus beta" \
--isolate "Bic" \
--name "P3L16" \
--transcript_evidence "pasa/Obeta_hq_transcripts_fixed.fasta.clean" \
--rna_bam "pasa/Obeta_hq_transcripts_fixed.fasta.clean.mm2.bam" \
--pasa_gff "pasa/fOpsBet2.1_pasa_db.assemblies.fasta.transdecoder.genome.gff3" \
--organism "other" \
--repeats2evm \
--keep_evm \
--optimize_augustus \
--cpus 10

for update:

funannotate update \
-i ../funannotate/predict \
--cpus 15 \
--left SRR3192986_1_fix.fastq SRR3193033_1_fix.fastq SRR3193034_1_fix.fastq SRR3193035_1_fix.fastq \
--right SRR3192986_2_fix.fastq SRR3193033_2_fix.fastq SRR3193034_2_fix.fastq SRR3193035_2_fix.fastq \
--memory 110G \
--pacbio_isoseq Obeta_hq_transcripts_fixed.fasta \
--species "Opsanus beta" \
--isolate "Bic" \
--name "P3L16" \
--sbt ../funannotate/Obeta_Bic.sbt \
--pasa_config $PASAHOME/pasa_conf/conf.txt \
--pasa_db mysql

Logfiles currently rerunning the update with more RAM, will provide when finishes.

OS/Install Information

For Predict (HPC)

OS: CentOS Linux 7, 16 cores, ~ 264 GB RAM. Python: 3.8.15

For Update (local)

OS: Pop!_OS 22.04, 20 cores, ~ 132 GB RAM. Python: 3.8.15

both installs of funannotate were done the same way, at the same time, and parameterized them same.

-------------------------------------------------------
Checking dependencies for 1.8.15
-------------------------------------------------------
You are running Python v 3.8.15. Now checking python packages...
biopython: 1.81
goatools: 1.2.3
matplotlib: 3.4.3
natsort: 8.3.1
numpy: 1.24.3
pandas: 1.5.3
psutil: 5.9.5
requests: 2.31.0
scikit-learn: 1.2.2
scipy: 1.10.1
seaborn: 0.12.2
All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules...
Carp: 1.50
Clone: 0.46
DBD::SQLite: 1.72
DBD::mysql: 4.046
DBI: 1.643
DB_File: 1.855
Data::Dumper: 2.183
File::Basename: 2.85
File::Which: 1.24
Getopt::Long: 2.54
Hash::Merge: 0.302
JSON: 4.10
LWP::UserAgent: 6.67
Logger::Simple: 2.0
POSIX: 1.94
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.14
Tie::File: 1.06
URI::Escape: 5.17
YAML: 1.30
local::lib: 2.000029
threads: 2.25
threads::shared: 1.61
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/home/extradisk2/databases/funannotate
$PASAHOME=/home/nicholaskron/miniconda3/envs/funannotate/opt/pasa-2.5.2
$TRINITYHOME=/home/nicholaskron/miniconda3/envs/funannotate/opt/trinity-2.8.5
$EVM_HOME=/home/nicholaskron/miniconda3/envs/funannotate/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/home/nicholaskron/miniconda3/envs/funannotate/config/
$GENEMARK_PATH=/home/nicholaskron/miniconda3/envs/funannotate/opt/gmes_linux_64_4
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.5.2
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.5.0
bamtools: bamtools 2.5.1
bedtools: bedtools v2.31.0
blat: BLAT v35
diamond: 2.1.7
emapper.py: 2.1.10
ete3: 3.1.2
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2021-08-25
gmes_petap.pl: 4.71_lic
hisat2: 2.2.1
hmmscan: HMMER 3.3.2 (Nov 2020)
hmmsearch: HMMER 3.3.2 (Nov 2020)
java: 17.0.3-internal
kallisto: 0.46.1
mafft: v7.520 (2023/Mar/22)
makeblastdb: makeblastdb 2.14.0+
minimap2: 2.26-r1175
pigz: 2.6
proteinortho: 6.2.3
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.16.1
signalp: 6.0
snap: 2006-07-28
stringtie: 2.2.1
tRNAscan-SE: 2.0.11 (Oct 2022)
tantan: tantan 40
tbl2asn: 25.8
tblastn: tblastn 2.14.0+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
All 37 external dependencies are installed
nextgenusfs commented 7 months ago

I'm not entirely sure where the issue is. But I suspect it is certainly the update step. It might be a bug or as you mentioned an artifact of how you ran it. But I assume the log files from predict when it ran tbl2asn did not have these errors. So it could be that the code parsing the PASA GFF3 output didn't work as expected.

Nicholas-Kron commented 7 months ago

Hi Jon,

Yes, you are correct the predict step did not produce these errors. I ended up rerunning the whole pipeline from scratch, using input short and long reads from train all the way to update. Unfortunately, even in a clean run without mixing database types, inputs, or computers I still get the same problems (I suppose on the bright side this suggests that your pipeline is robust to my shenanigans).

Running the full pipeline from predict to update cleanly did not fix the UTR annotation issue, it is still really low despite 90% having 5 and 3 prime UTRs in the PASA db. As you mentioned, the predict tabl2asn lacks any remarks on FEATURE_LOCATION_CONFLICT , only 6 OVERLAPPING_GENES, or FIND_OVERLAPPED_GENES; of which the update tabl2asn out has hundreds. Looking more at the update table2asn another few errors, like BAD_LOCUS_TAG_FORMAT, which further points to something happening during update like you say.

When I actually looked at the FEATURE_LOCATION_CONFLICT genes, none of the mRNA isoforms or exons are actually outside the bounds of the gene. As far as the OVERLAPPING_GENES, looking at those reveals a bunch of weird genes that are up to a few megabases and overlap several smaller genes that downstream receive similar annotations. So it seems like update is inserting lots of new overlapping genes? update also gives me a new file for gene models that need fixing that now has 9 genes with multiple internal stops, none of which correspond to the problematic overlapping genes.

Not really sure how to proceed. I suppose I could forgo the update step but I don't like the idea of having incomplete annotations. Thanks for your time on this!

Here are the errors for the new full run of the pipeline at update:

Opsanus_beta_Bic.stats.json Opsanus_beta_Bic.pasa-reannotation.changes.txt Opsanus_beta_Bic.models-need-fixing.txt Opsanus_beta_Bic.error.summary.txt Opsanus_beta_Bic.discrepency.report.txt funannotate-update.log

nextgenusfs commented 7 months ago

Okay, I think probably this function is the issue: https://github.com/nextgenusfs/funannotate/blob/4d8e196295f24535a3e5fb0149aadcd66e5c032a/funannotate/update.py#L1517

It tries to parse the strange IDS/format from PASA and generate proper gene models. But it seems that at least one locus_tag slipped through and likely perhaps some other errors. The overlapping genes is unlikely a "fatal" error. I would like need to access to the intermediate results in order to try to fix.

Nicholas-Kron commented 7 months ago

Makes sense. Would a tarball of the update_misc folder be sufficient?

nextgenusfs commented 7 months ago

Yeah that should probably work.

Nicholas-Kron commented 7 months ago

Here is the tarball of the update_misc folder (uploaded to google drive). I didn't include the tbl2asn files since I figured those would be regenerated anyway, I hope that is OK. If not I can re-compress and upload again. Thanks again for your help!