nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
314 stars 83 forks source link

funannotate annotate dbCAN error: ValueError: The ID or alternative IDs of Hit * exists in this QueryResult. #699

Closed ferninfm closed 2 years ago

ferninfm commented 2 years ago

Hi, I have run of ideas on how to solve this.

Problem: funannotate annotate breaks in the dbCAN stage.

ValueError: The ID or alternative IDs of Hit 'CE10' exists in this QueryResult.

For each run it breaks at a different ID.

What I did:

0) Upgraded to latest github version. 1) I went bottom up and tested multiple things (See below) which were not useful. 2) The analyses do run, dbCAN.txt has been populated (--cpu 1) but dbCAN.filtered.txt is not (has a header) 3) The partitioning and runs, Chunk_*.dbcan.txt have been populated (--cpu 32), also dbCAN.txt but dbCAN.filtered.txt is not (has a header) 4) The problem has to do with parsing using the SearchIO.parse module of biopython (Line 120) 5) Head of dbCAN.txt

#                                                                                                      --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name                                     accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- -----                           -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
GT69                 -            239 GCA001599015.1ApiotrichumdomesticumJCM9580_007 -            556   7.9e-66  217.3   0.0   1   1   1.6e-68     1e-65  217.0   0.0     1   238   185   463   185   464 0.97 -
GT2_Chitin_synth_2   PF03142.14   527 GCA001599015.1ApiotrichumdomesticumJCM9580_007 -           1109  1.8e-256  847.1   0.0   1   1  4.7e-259    3e-256  846.3   0.0    13   526   389   895   380   896 0.98 Chitin synthase
GH128                -            224 GCA001599015.1ApiotrichumdomesticumJCM9580_007 -            290   3.2e-51  169.5   0.1   1   1   6.2e-54     4e-51  169.2   0.1     5   224    44   284    38   284 0.91 -
AA4                  -            522 GCA001599015.1ApiotrichumdomesticumJCM9580_007 -            575   5.9e-26   86.4   0.0   1   2   5.1e-25   1.6e-22   75.1   0.0    12   261    95   337    90   359 0.79 -
AA4                  -            522 GCA001599015.1ApiotrichumdomesticumJCM9580_007 -            575   5.9e-26   86.4   0.0   2   2   5.7e-05     0.018    8.9   0.0   486   509   530   553   498   558 0.89 -
AA7                  -            458 GCA001599015.1ApiotrichumdomesticumJCM9580_007 -            575   1.5e-23   79.5   0.5   1   2   1.4e-25   4.6e-23   77.8   0.5     4   203   125   334   122   370 0.88 -
AA7                  -            458 GCA001599015.1ApiotrichumdomesticumJCM9580_007 -            575   1.5e-23   79.5   0.5   2   2      0.21        68   -1.8   0.0   435   454   532   551   518   553 0.82 -

6) Head of a different analysis that worked, format has not changed

#                                                                                        --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name                       accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- -----             -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
GT1                  -            382 AlectoriasarmentosaGT1_008627-T1 -           1431   1.8e-57  190.9   0.5   1   1   5.2e-60   3.3e-57  190.1   0.5     5   379   951  1364   948  1367 0.99 -
GT34                 -            246 AlectoriasarmentosaGT1_008728-T1 -            317   5.3e-71  234.3   0.0   1   1   9.9e-74   6.3e-71  234.1   0.0     2   241    63   293    62   300 0.95 -
GT2_Chitin_synth_1   PF01644.16   163 AlectoriasarmentosaGT1_008756-T1 -            946     2e-65  215.0   0.0   1   1   5.3e-68   3.4e-65  214.3   0.0     1   163   185   348   185   348 0.96 Chitin synthase
AA2                  -            255 AlectoriasarmentosaGT1_008761-T1 -            377   1.6e-60  200.0   0.0   1   2      0.29   1.9e+02   -3.4   0.0   164   191    12    39     8    56 0.60 -
AA2                  -            255 AlectoriasarmentosaGT1_008761-T1 -            377   1.6e-60  200.0   0.0   2   2   4.8e-63   3.1e-60  199.1   0.0    14   254   120   360   106   361 0.94 -

7) Upgrading to biopython 1.79 did not change the behaviour and gave a compatibility warning with eggnog mapper (requires biopython v1.76) 8) Downgrading to 1.76 also did not change things


I also checked other possibilities 1) I updated the dbCAN database using funannotate 2) I checked dbCAN.hmm and had no duplicated hmms 3) I erased the files produced by hmmpress and rebuilt them (The realised this has no influence whatsoever) 4) I run "hmmscan --domtblout eraseme_outfiles2 --cpu 12 -E 1e-17 $FUNANNOTATE_DB/dbCAN.hmm $FILE" as found in line 87 of hmmer_parallel.py. It runs without a problem 5) I thouth the problem was in line 193 when splitProts is generated (to parallelize?), but when I rerun annotate toggling --cpus 1 the problem persists


Complete error output

 funannotate annotate    -i ${FILE}    -o ${FILE}    -s ${FILE}    --sbt /cl_tmp/fernandf/CD/Cerodens.sbt --iprscan ${FILE}/*_interpro.xml    --antismash ${AS}    --phobius ${FILE}/*.phobius    --busco_db pezizomycotina    --force --cpus 32 --signalp ${FILE}/*_summary.signalp5 --eggnog ${FILE}/eggnog_*.emapper.annotations
-------------------------------------------------------
[Mar 01 03:11 PM]: OS: CentOS Linux 7, 64 cores, ~ 528 GB RAM. Python: 3.7.10
[Mar 01 03:11 PM]: Running 1.8.10
[Mar 01 03:11 PM]: Found existing output directory Apiotrichum_domesticum_JCM_9580_GCA_001599015.1. Warning, will re-use any intermediate files found.
[Mar 01 03:11 PM]: Parsing input files
[Mar 01 03:11 PM]: Existing tbl found: Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/predict_results/GCA_001599015.1_Apiotrichum_domesticum_JCM_9580.tbl
[Mar 01 03:11 PM]: Adding Functional Annotation to Apiotrichum_domesticum_JCM_9580_GCA_001599015.1, NCBI accession: None
[Mar 01 03:11 PM]: Annotation consists of: 7,737 gene models
[Mar 01 03:11 PM]: 7,363 protein records loaded
[Mar 01 03:11 PM]: Existing Pfam-A results found: Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/annotate_misc/annotations.pfam.txt
[Mar 01 03:11 PM]: 1,769 annotations added
[Mar 01 03:11 PM]: Running Diamond blastp search of UniProt DB version 2021_03
[Mar 01 03:11 PM]: 483 valid gene/product annotations from 615 total
[Mar 01 03:11 PM]: Existing Eggnog-mapper results found: Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/annotate_misc/eggnog.emapper.annotations
[Mar 01 03:11 PM]: Parsing EggNog Annotations
[Mar 01 03:11 PM]: EggNog version parsed as 2.1.3
[Mar 01 03:11 PM]: 13,404 COG and EggNog annotations added
[Mar 01 03:11 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.72
[Mar 01 03:11 PM]: 2,178 gene name and product description annotations added
[Mar 01 03:11 PM]: Existing MEROPS results found: Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/annotate_misc/annotations.merops.txt
[Mar 01 03:11 PM]: 239 annotations added
[Mar 01 03:11 PM]: Annotating CAZYmes using HMMer search of dbCAN version 10.0
Traceback (most recent call last):
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/funannotate/aux_scripts/hmmer_parallel.py", line 193, in <module>
    dbCANsearch(splitProts, args.cpus, 1e-17, args.input, args.out)
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/funannotate/aux_scripts/hmmer_parallel.py", line 120, in dbCANsearch
    for qresult in SearchIO.parse(results, "hmmscan3-domtab"):
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/Bio/SearchIO/__init__.py", line 306, in parse
    yield from generator
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_tab.py", line 33, in __iter__
    yield from self._parse_qresult()
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/Bio/SearchIO/HmmerIO/hmmer3_domtab.py", line 151, in _parse_qresult
    qresult = QueryResult(hit_list, prev_qid)
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/Bio/SearchIO/_model/query.py", line 206, in __init__
    self.append(hit)
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/Bio/SearchIO/_model/query.py", line 472, in append
    % hit_key
ValueError: The ID or alternative IDs of Hit 'CE10' exists in this QueryResult.
Traceback (most recent call last):
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/bin/funannotate", line 8, in <module>
    sys.exit(main())
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/funannotate/funannotate.py", line 711, in main
    mod.main(arguments)
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/funannotate/annotate.py", line 979, in main
    num_annotations = lib.line_count(dbCAN_out)
  File "/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/lib/python3.7/site-packages/funannotate/library.py", line 974, in line_count
    with open(fname) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/annotate_misc/annotations.dbCAN.txt'

funannotate check

-------------------------------------------------------
Checking dependencies for 1.8.10
-------------------------------------------------------
You are running Python v 3.7.10. Now checking python packages...
biopython: 1.78
goatools: 1.0.15
matplotlib: 3.3.4
natsort: 7.1.1
numpy: 1.20.1
pandas: 1.2.3
psutil: 5.8.0
requests: 2.25.1
scikit-learn: 0.24.1
scipy: 1.5.3
seaborn: 0.11.1
All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules...
Bio::Perl: 1.007002
Carp: 1.38
Clone: 0.42
DBD::SQLite: 1.64
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.855
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.02
LWP::UserAgent: 6.39
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.12
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.29
threads: 2.15
threads::shared: 1.56
All 28 Perl modules installed

Checking Environmental Variables...
$FUNANNOTATE_DB=/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/database
$PASAHOME=/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/opt/pasa-2.4.1
$TRINITYHOME=/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/opt/trinity-2.8.5
$EVM_HOME=/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/opt/evidencemodeler-1.1.1/
$AUGUSTUS_CONFIG_PATH=/usr/people/EDVZ/fernandf/miniconda3/envs/funannotate_1.8.4/config
$GENEMARK_PATH=/usr/people/EDVZ/fernandf/1_Programs/00_funannotate/gm_et_linux_64/gmes_petap/
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.4.1
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.3.3
bamtools: bamtools 2.5.1
bedtools: bedtools v2.30.0
blat: BLAT v36
diamond: 2.0.7
emapper.py: 2.1.3
ete3: 3.1.2
exonerate: exonerate 2.4.0
fasta: no way to determine
glimmerhmm: 3.0.4
gmap: 2017-11-15
gmes_petap.pl: 4.33
hisat2: 2.2.1
hmmscan: HMMER 3.3.2 (Nov 2020)
hmmsearch: HMMER 3.3.2 (Nov 2020)
java: 11.0.8-internal
kallisto: 0.46.1
mafft: v7.475 (2020/Nov/23)
makeblastdb: makeblastdb 2.2.31+
minimap2: 2.17-r941
pigz: pigz 2.6
proteinortho: 6.0.16
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.10
signalp: 5.0b
snap: 2006-07-28
stringtie: 2.1.5
tRNAscan-SE: 2.0.7 (Oct 2020)
tantan: tantan 26
tbl2asn: no way to determine, likely 25.X
tblastn: tblastn 2.2.31+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
All 37 external dependencies are installed
hyphaltip commented 2 years ago

Those are long protein names - I don't know if that is problem but usually the IDs will be just the GT1_008627-T1 or JCM9580_007 - shouldn't break the parser necessarily but that looks odd to me

ferninfm commented 2 years ago

Dear Jason, I see your point. I tried to batch reprocess some published genomes without loosing information and was obviously too generous with the names. I can imagine the point in GCA001599015.1 could be a source of problems Let's see if I can solve this today.

Fernando

ferninfm commented 2 years ago

So I renamed proteins using sed in all files then repeated annotate and the error remains.

Somehow funannotate annotate that degrades the sequence names while copying the fasta files from predict_results to annotate_misc. A space is included in within the fasta name string

predict_results


==> Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/predict_results/GCA_001599015.1_Apiotrichum_domesticum_JCM_9580.proteins.fa <==
>GCA001599015.1ApiotrichumdomesticumJCM9580_000001-T1 GCA001599015.1ApiotrichumdomesticumJCM9580_000001

annotate misc

==> Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/annotate_misc/genome.proteins.fasta <==
>GCA001599015.1ApiotrichumdomesticumJCM9580_000 001-T1 GCA001599015.1ApiotrichumdomesticumJCM9580_000 001

This happens irrespective of name size

==> predict_results/GCA_002973495.1_Apiotrichum_akiyoshidainum_HP2023.mrna-transcripts.fa <==
>GCA002973495_000001-T1 GCA002973495_000001
==> annotate_misc/genome.transcripts.fasta <==
>GCA002973495_ 000001-T1 GCA002973495_ 000001

This damages both the PFam and dbCan result files, although the first does not detect an error. I checked and never had it before this time, but I can reproduce it in two different installations of funannotate 1.7.4 and latest in two operating systems. So it may be my script (?)

I reconstituted the transcript and protein files using sed to replace 's/ //g' by hand and tried to run not using --force but the files are rewritten anyway. The spaces fallin the same place in each naming convention but in different places depending on the length of the name (long name 9 characters from the end, short name 7 characters...)

I am very puzzled with this

nextgenusfs commented 2 years ago

I'm not sure what is going on here -- but certainly the names are too long to be parsed by biopython HMM parser that is why it is failing. I would delete results from funannotate annotate and then re-run funannotate predict with something more logical in the --name field, ie if you wanted to use the strain isolate name that would work: --name HP2023 -- this should re-use the existing data you have for predict but generate better gene model names.

It seems you must have run --name GCA001599015.1ApiotrichumdomesticumJCM9580 on the first pass? I didn't have a length limit on the name variable as I never imagined it would be a problem -- its designed to use NCBI locus_tag identifier which is at least 3 alpha numeric characters (for genome submissions it is assigned by NCBI if you don't provide your own). The proposal is here

The locustag prefix can contain only alpha-numeric characters and it must be at least 3 characters long. It should start with a letter, but numerals can be in the 2nd position or later in the string. (ex. A1C). There should be no symbols, such as -* in the prefix. The locustag prefix is to be separated from the tag value by an underscore ‘’, eg A1C_00001.

Most commonly now it is 4 or 5 characters, AB01.

ferninfm commented 2 years ago

Hi Jon,

Thanks for clarifying the locus_tag ideas.

For the record the length of the locus tag is not the problem. I erased the annotate results and rerun, and still when it runs it inserts spaces within the names of the fasta mrna.transcript and protein files, which is wat causes the parsing error.

As for the long locus_tag, it is not mine, It was created by predict because i specified --species, --strain and --isolate, but forgot to specify the locus tag with --name, so they all get pasted as locus tag (without underscores).

I did rerun predict with --name and repeated annotate. Somehow the files are not altered that way, although the files in predict_results look identical....

Thanks

nextgenusfs commented 2 years ago

The default for the --name parameter in predict is FUN, so something else must have happened during the first run perhaps a mistake in a wrapper script or something of that nature.

The fasta headers are printed with a space as it is a common format, ">transcriptid genename". Biopython will parse this as rec.id == transcriptid and rec.description == genename.

ferninfm commented 2 years ago

Hi, My mistake I did in fact specify the long --name in the batch submission, you were roght all along... I mean this has become an absolutely irrelevant issue, more to do with me trying to batch submt to the cluster instead of specifying the parameters one by one... However

That the fasta headers are printed with a space is clear, I have eyes to see. The question is how and why are additional systematically placed spaces appearing within the transcript and genenames, not between them. Spaces that otherwise did not exist in the predict_results but are generated when copying/parsing those files into the annotate_misc input files.

I think that behaviour may be caused by having underscores in the name? Anyway. Irrelevant issue.

Thanks for the help I was a stuck trying to troubleshoot a simple mistake.

Take care

nextgenusfs commented 2 years ago

Yes I think you are right that the underscores are problematic for the locus tag. This is because funannotate assumes each locus rage will have one underscore between the locus tag prefix and the numerical portion -- this is defined in NCBIs docs so that's why it uses that convention.