nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

funannotate predict did not incoporate UTRs from PASA #948

Closed Nicholas-Kron closed 10 months ago

Nicholas-Kron commented 10 months ago

Are you using the latest release? version used: Funannotate 1.8.15 (latest) via mamba

Describe the bug funannotate predict did not incorporate UTR information from externally generated PASA gff3 file. PASA v2.5.2 run externally annotated 38,552 transcripts as having complete UTRs. Funannotate output reports 0 genes with 5', 3', or complete UTRs. The prediction seems to have worked fine, lots of predicted genes, just no UTRs. A colleague running the same scripts with an external PASA transcriptome on the same HPC cluster with a different organims/genome did not encounter this issue. I have to rerun anyway because GeneMark install broke for some reason, but I don't think that should have affected incorporating the PASA UTR annotation, right? I have done some debugging of perl as that is often the source of problems in my experience on our HPC and rerunning, but wanted to see if maybe something else could have been the cause. Thanks for your help!

   "annotation": {
        "genes": 42383,
        "common_name": 0,
        "mRNA": 36058,
        "tRNA": 6325,
        "ncRNA": 0,
        "rRNA": 0,
        "avg_gene_length": 16166.35,
        "transcript-level": {
            "CDS_transcripts": 36058,
            "CDS_five_utr": 0,
            "CDS_three_utr": 0,
            "CDS_no_utr": 36058,
            "CDS_five_three_utr": 0,
            "CDS_complete": 29903,
            "CDS_no-start": 2865,
            "CDS_no-stop": 2129,
            "CDS_no-start_no-stop": 1161,
            "total_exons": 237440,
            "total_cds_exons": 237440,
            "multiple_exon_transcript": 32207,
            "single_exon_transcript": 3851,
            "avg_exon_length": 182.68,
            "avg_protein_length": 414.7,
            "functional": {
                "go_terms": 0,
                "interproscan": 0,
                "eggnog": 0,
                "pfam": 0,
                "cazyme": 0,
                "merops": 0,
                "busco": 0,
                "secretion": 0

The UTRs are in the PASA gff3 as:

scaffold_13     transdecoder    three_prime_UTR 76859755        76860001        .       -       .       ID=asmbl_10001.p1.utr3p1;Parent=asmbl_10001.p1
...
scaffold_13     transdecoder    five_prime_UTR  76879823        76880274        .       -       .       ID=asmbl_10002.p1.utr5p1;Parent=asmbl_10002.p1
...

in the pasa.training.tmp.gtf they are listed:

scaffold_1      pasa    3UTR    32246   35593   0       -       0       gene_id "GENE.asmbl_1~~asmbl_1.p1^scaffold_1^-"; transcript_id "asmbl_1.p1";
...
scaffold_1      pasa    5UTR    72909   73053   0       -       0       gene_id "GENE.asmbl_2~~asmbl_2.p4^scaffold_1^-"; transcript_id "asmbl_2.p4";
...

What command did you issue?

funannotate predict \
-i fOpsBet2.1_genomic.full_mask.soft.fa \
-o funannotate/predict \
--SeqCenter "UC Davis DNA Technologies & Expression Analysis Core" \
--species "Opsanus beta" \
--isolate "Bic" \
--name "P3L16" \
--transcript_evidence "pasa/Obeta_hq_transcripts_fixed.fasta.clean" \
--rna_bam "pasa/Obeta_hq_transcripts_fixed.fasta.clean.mm2.bam" \
--pasa_gff "pasa/fOpsBet2.1_pasa_db.assemblies.fasta.transdecoder.genome.gff3" \
--organism "other" \
--repeats2evm \
--keep_evm \
--optimize_augustus \
--cpus 10

Logfiles Please provide relavent log files of the error.

funannotate-predict.log

OS/Install Information

OS: CentOS Linux 7, 16 cores, ~ 264 GB RAM. Python: 3.8.15
-------------------------------------------------------
Checking dependencies for 1.8.15
-------------------------------------------------------
You are running Python v 3.8.15. Now checking python packages...
biopython: 1.81
goatools: 1.2.3
matplotlib: 3.4.3
natsort: 8.3.1
numpy: 1.24.3
pandas: 1.5.3
psutil: 5.9.5
requests: 2.31.0
scikit-learn: 1.2.2
scipy: 1.10.1
seaborn: 0.12.2
All 11 python packages installed

You are running Perl v b'5.032001'. Now checking perl modules...
Carp: 1.38
Clone: 0.46
DBD::SQLite: 1.72
DBD::mysql: 4.050
DBI: 1.643
DB_File: 1.858
Data::Dumper: 2.183
File::Basename: 2.84
File::Which: 1.24
Getopt::Long: 2.54
Hash::Merge: 0.302
JSON: 4.10
LWP::UserAgent: 6.67
Logger::Simple: 2.0
POSIX: 1.94
Parallel::ForkManager: 2.02
Pod::Usage: 2.03
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.14
Tie::File: 0.98
URI::Escape: 5.12
YAML: 1.30
local::lib: 2.000029
threads: 2.25
threads::shared: 1.61
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/nethome/n.kron/coral_omics/databases/funannotate_db
$PASAHOME=/nethome/n.kron/mambaforge/envs/funannotate/opt/pasa-2.5.2
$TRINITYHOME=/nethome/n.kron/mambaforge/envs/funannotate/opt/trinity-2.8.5
$EVM_HOME=/nethome/n.kron/mambaforge/envs/funannotate/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/nethome/n.kron/mambaforge/envs/funannotate/config/
$GENEMARK_PATH=//nethome/n.kron/mambaforge/envs/funannotate/opt/gmes_linux_64
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.5.2
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.5.0
bamtools: bamtools 2.5.1
bedtools: bedtools v2.31.0
blat: BLAT v37x1
diamond: 2.1.7
emapper.py: 2.1.11
ete3: 3.1.2
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2023-04-28
hisat2: 2.2.1
hmmscan: HMMER 3.3.2 (Nov 2020)
hmmsearch: HMMER 3.3.2 (Nov 2020)
java: 15.0.1
kallisto: 0.46.1
mafft: v7.520 (2023/Mar/22)
makeblastdb: makeblastdb 2.14.0+
minimap2: 2.26-r1175
pigz: 2.6
proteinortho: 6.2.3
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.16.1
snap: 2006-07-28
stringtie: 2.2.1
tRNAscan-SE: 2.0.11 (Oct 2022)
tantan: tantan 40
tbl2asn: 25.8
tblastn: tblastn 2.14.0+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
    ERROR: gmes_petap.pl not installed
    ERROR: signalp not installed
hyphaltip commented 10 months ago

I'm confused of your goals - generally the 'update' command is where UTRs are added and incoporated into annotation and analysis - have you tried your approach with the update command?

Nicholas-Kron commented 10 months ago

So I should not expect any UTR information to be carried over from the external PASA run during predict? For my colleague, the initial predict run incorporated all UTRs predicted from PASA in the predict step, which were further refined with update run afterward (e.g. his predict run resulted in ~30k genes with UTRs, and update improved ~4k of them). Based on his experience I assumed something must have gone wrong that my UTR information was not incorporated during the predict step. If not incorporating UTRs during predict is normal behavior than that is my mistake and I will proceed to update.

nextgenusfs commented 10 months ago

Correct, predict will not have any UTR information. https://funannotate.readthedocs.io/en/latest/update.html

To add UTRs from PASA compare annotations methodology, you can run funannotate update after running predict. It will utilize the existing PASA alignments from the database and modify existing gene models including adding UTR information if it is present.

Nicholas-Kron commented 10 months ago

Ah I see, my mistake. Thank you for clearing that up. Must have been some miscommunication on our part. I will proceed to update then. Thank you for your assistance!