nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

funannotate update, kallisto error #1025

Open schraderL opened 2 months ago

schraderL commented 2 months ago

Hi, I am running into an error with funannotate update, trying to add UTRs to existing CDS annotations. It looks like it is an issue with Kallisto finding duplicate fasta header names in update_misc/getBestModel/transcripts.fa. Any idea how to fix this?

Are you using the latest release?

I am using the docker container for funannotate 1.8.17 in singularity with

singularity pull funannotate.sif docker://nextgenusfs/funannotate
singularity shell --bind ~/projects/UTRs:/pasaUTRs,~/fq/A006200344_207613_S1/:/fq/ funannotate.sif

Describe the bug Running funannotate update to update UTRs in a gff3 not produced by funannotate, I get the following error and the process dies:

-------------------------------------------------------
[Apr 10 09:11 AM]: OS: Debian GNU/Linux 10, 36 cores, ~ 98 GB RAM. Python: 3.8.12
[Apr 10 09:11 AM]: Running 1.8.17
[Apr 10 09:11 AM]: No NCBI SBT file given, will use default, for NCBI submissions pass one here '--sbt'
[Apr 10 09:11 AM]: Previous annotation consists of: 0 protein coding gene models and 0 non-coding gene models
[Apr 10 09:11 AM]: Existing annotation: locustag=COBS genenumber=20966
[Apr 10 09:11 AM]: Trimmomatic will be skipped
[Apr 10 09:11 AM]: Existing Trinity results found: CARD-0001/update_misc/trinity.fasta
[Apr 10 09:11 AM]: Existing BAM alignments found: CARD-0001/update_misc/trinity.alignments.bam, CARD-0001/update_misc/transcript.alignments.bam
[Apr 10 09:11 AM]: Using Kallisto TPM data to determine which PASA gene models to select at each locus
[Apr 10 09:11 AM]: Building Kallisto index
[Apr 10 09:11 AM]: CMD ERROR: kallisto index -i CARD-0001/update_misc/getBestModel/bestModel CARD-0001/update_misc/getBestModel/transcripts.fa
[Apr 10 09:11 AM]:
[build] loading fasta file CARD-0001/update_misc/getBestModel/transcripts.fa
[build] k-mer length: 31
Error: repeated name in FASTA file CARD-0001/update_misc/getBestModel/transcripts.fa
novel_model_3648_661583a6

Run with --make-unique to replace repeated names with unique names

In CARD-0001/update_misc/getBestModel/transcripts.fa there is one duplicate entry: Running

cat CARD-0001/update_misc/getBestModel/transcripts.fa|grep ">" |sort|cut -f 1 -d " "|uniq -d

returns:

>novel_model_3648_661583a6

And

grep ">novel_model_3648_661583a6" CARD-0001/update_misc/getBestModel/transcripts.fa

returns

>novel_model_3648_661583a6 novel_gene_1684_661583a6 NO NAME ASSIGNED LG13:1345282-1352116(-) >novel_model_3648_661583a6 novel_gene_1683_661583a6 NO NAME ASSIGNED LG6:5920601-5922436(-)

What command did you issue?

funannotate update -f CARD-0001.fa -g CARD-0001.gff3 --species CARD-0001 --out CARD-0001 -l /fq/CARD-0001-val_1.fq.gz -r /fq/CARD-0001-val_2.fq.gz --no_trimmomatic  --cpus 36

Logfiles

OS/Install Information

Checking dependencies for 1.8.17

You are running Python v 3.8.12. Now checking python packages...
biopython: 1.79
goatools: 1.3.11
matplotlib: 3.7.0
natsort: 8.4.0
numpy: 1.23.0
pandas: 2.0.3
psutil: 5.9.1
requests: 2.31.0
scikit-learn: 0.24.2
scipy: 1.5.3
seaborn: 0.13.0
All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules...
Carp: 1.38
Clone: 0.42
DBD::SQLite: 1.64
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.855
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.02
LWP::UserAgent: 6.39
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.12
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.29
local::lib: 2.000024
threads: 2.15
threads::shared: 1.56
All 27 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/opt/databases
$PASAHOME=/venv/opt/pasa-2.4.1
$TRINITYHOME=/venv/opt/trinity-2.8.5
$EVM_HOME=/venv/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/usr/share/augustus/config
        ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir
-------------------------------------------------------
Checking external dependencies...
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.3.2
bamtools: bamtools 2.5.2
bedtools: bedtools v2.31.1
blat: BLAT v35
diamond: 2.1.8
ete3: 3.1.2
exonerate: exonerate 2.4.0
fasta: 36.3.8g
glimmerhmm: 3.0.4
gmap: 2017-11-15
hisat2: 2.2.1
hmmscan: HMMER 3.4 (Aug 2023)
hmmsearch: HMMER 3.4 (Aug 2023)
java: 11.0.8-internal
kallisto: 0.46.1
mafft: v7.520 (2023/Mar/22)
makeblastdb: makeblastdb 2.2.31+
minimap2: 2.26-r1175
pigz: 2.8
proteinortho: 6.0.16
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.12
snap: 2006-07-28
stringtie: 2.2.1
tRNAscan-SE: 2.0.9 (July 2021)
tantan: tantan 49
tbl2asn: 25.8
tblastn: tblastn 2.2.31+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
        ERROR: emapper.py not installed
        ERROR: gmes_petap.pl not installed
        ERROR: signalp not installed
schraderL commented 2 months ago

A quick addition: I can run kallisto index and kallisto quant just fine manually, when adding the --make-unique option to the indexing:

 kallisto index -i CARD-0001/update_misc/getBestModel/bestModel CARD-0001/update_misc/getBestModel/transcripts.fa --make-unique 
kallisto quant -i CARD-0001/update_misc/getBestModel/bestModel -o CARD-0001/update_misc/kallisto --plaintext -t 36 /fq/A006200344_207613_S1_L000_R1_001_val_1.fq.gz /fq/CARD-0001-val_1.fq.gz -r /fq/CARD-0001-val_2.fq.gz

But I can't pick up the funannotate update command after that again, as it complains that the getBestModel folder already exists:

FileExistsError: [Errno 17] File exists: 'CARD-0001/update_misc/getBestModel'