huangnengCSU / compleasm

A genome completeness evaluation tool based on miniprot
Apache License 2.0
172 stars 16 forks source link

KeyError: 'Target_Species' #12

Open larsmoret opened 10 months ago

larsmoret commented 10 months ago

Dear all, I must say, I am quite intrigued comparing it to BUSCO

However, I came across an error while trying to run it and i have no idea where to look. While trying to run Compleasm, it suddenly stops and displays KeyError: 'Target_Species'

Has anyone had the same issue or any idea where the problem might be?

Thanks in advance, Lars Moret

P.S. This is my entire log, please note that i have installed Compleasm using conda.

(checker) lmoret@ubuntudesktopc:/data/volume_2$ compleasm run -a finalassemblies/CBS1922.fasta -o compleasmoutput/CBS1922 -l fungi -t 14 Searching for miniprot in the path where compleasm.py is located Searching for miniprot in the current execution path Searching for hmmsearch in the path where compleasm.py is located Searching for hmmsearch in the current execution path miniprot execute command: /data/volume_2/compleasm_kit/miniprot lineage: fungi_odb10 hmmsearch execute command: /data/volume_2/compleasm_kit/hmmsearch Traceback (most recent call last): File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Target_species'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/lmoret/miniconda3/envs/checker/bin/compleasm", line 10, in sys.exit(main()) File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 2534, in main args.func(args) File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 2426, in run mr.Run() File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 2142, in Run miniprot_alignment_parser.Run() File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 1158, in Run self.Run_busco_mode() File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 1234, in Run_busco_mode filtered_species = records_df["Target_species"].unique() File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/pandas/core/frame.py", line 3458, in getitem indexer = self.columns.get_loc(key) File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc raise KeyError(key) from err KeyError: 'Target_species' (checker) 1 lmoret@ubuntudesktopc:/data/volume_2$

huangnengCSU commented 10 months ago

Hi @larsmoret Could you list the files (also the filesize) under the directory "fungi_odb10" of output folder?

larsmoret commented 10 months ago

(checker) 130 lmoret@ubuntudesktopc:~/data/volume_2$ ls compleasmoutput/CBS1922/fungi_odb10/ hmmer_output hmmsearch.done miniprot.done miniprot_output.gff translated_protein.fasta

Total file size is: 25M compleasmoutput/CBS1922/fungi_odb10

with per file: 1.5M compleasmoutput/CBS1922/fungi_odb10/hmmer_output/ 0 compleasmoutput/CBS1922/fungi_odb10/hmmsearch.done 0 compleasmoutput/CBS1922/fungi_odb10/miniprot.done 24M compleasmoutput/CBS1922/fungi_odb10/miniprot_output.gff 176K compleasmoutput/CBS1922/fungi_odb10/translated_protein.fasta

katiecdillon commented 10 months ago

**Hello,

I am running into the same issue as @larsmoret. Attached is my submission script.** SCRIPT_miniBUSCO_20231106_v1.txt

Here are the contents of the "arthropoda_odb10" directory:

-rw-r--r-- 1 kcd88651 tcglab 9676547 Nov 4 17:49 miniprot_output.gff -rw-r--r-- 1 kcd88651 tcglab 0 Nov 4 17:49 miniprot.done -rw-r--r-- 1 kcd88651 tcglab 0 Nov 4 17:49 hmmsearch.done drwxr-xr-x 2 kcd88651 tcglab 4096 Nov 4 17:49 hmmer_output -rw-r--r-- 1 kcd88651 tcglab 0 Nov 6 11:48 translated_protein.fasta

This is my error output:

Traceback (most recent call last): File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Target_species'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/kcd88651/.conda/envs/compleasm/bin/compleasm", line 10, in sys.exit(main()) File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 2534, in main args.func(args) File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 2426, in run mr.Run() File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 2142, in Run miniprot_alignment_parser.Run() File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 1158, in Run self.Run_busco_mode() File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/compleasm.py", line 1234, in Run_busco_mode filtered_species = records_df["Target_species"].unique() File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/pandas/core/frame.py", line 3458, in getitem indexer = self.columns.get_loc(key) File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc raise KeyError(key) from err KeyError: 'Target_species' Traceback (most recent call last): File "/home/kcd88651/.conda/envs/compleasm/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Target_species'

huangnengCSU commented 10 months ago

Hi @katiecdillon

Thanks for providing the script. Could you specify a different output folder name for each input assembly, instead of using "$D2" for all the assemblies?

huangnengCSU commented 10 months ago

Hi @larsmoret @katiecdillon ,

I have added some checks in the code to understand why something went wrong. The reason for KeyError "Target_species" is that there is no candidate alignment hits satisfying the BUSCO threshold. Could you clone the source code and re-run the failed case in the existing compleasm env?

e.g.

https://github.com/huangnengCSU/compleasm.git
python compleasm.py run -a $input_asm -l $lineage -o $output_folder -t $threads

Thanks!

larsmoret commented 10 months ago

Hi @huangnengCSU

Ive tried it, and now it loads the fungi_obd10 but it can not build the index.

Thanks in advance,

(checker) 2 lmoret@ubuntudesktopc:~/data/volume_2/compleasm$ compleasm run -a ~/finalassemblies/CBS1922.fasta -l fungi -o ~/compleasmoutput/ -t 14 Searching for miniprot in the path where compleasm.py is located Searching for miniprot in the current execution path Searching for miniprot in $PATH Searching for hmmsearch in the path where compleasm.py is located Searching for hmmsearch in the current execution path Searching for hmmsearch in $PATH miniprot execute command: /home/lmoret/miniconda3/envs/checker/bin/miniprot Success download from https://busco-data.ezlab.org/v5/data/file_versions.tsv Success download from https://busco-data.ezlab.org/v5/data/placement_files/list_of_reference_markers.eukaryota_odb10.2019-12-16.txt.tar.gz Placement file extraction path: mb_downloads/placement_files/list_of_reference_markers.eukaryota_odb10.2019-12-16.txt Success download from https://busco-data.ezlab.org/v5/data/placement_files/mapping_taxid-lineage.eukaryota_odb10.2019-12-16.txt.tar.gz Placement file extraction path: mb_downloads/placement_files/mapping_taxid-lineage.eukaryota_odb10.2019-12-16.txt Success download from https://busco-data.ezlab.org/v5/data/placement_files/mapping_taxids-busco_dataset_name.eukaryota_odb10.2019-12-16.txt.tar.gz Placement file extraction path: mb_downloads/placement_files/mapping_taxids-busco_dataset_name.eukaryota_odb10.2019-12-16.txt Success download from https://busco-data.ezlab.org/v5/data/placement_files/supermatrix.aln.eukaryota_odb10.2019-12-16.faa.tar.gz Placement file extraction path: mb_downloads/placement_files/supermatrix.aln.eukaryota_odb10.2019-12-16.faa Success download from https://busco-data.ezlab.org/v5/data/placement_files/tree.eukaryota_odb10.2019-12-16.nwk.tar.gz Placement file extraction path: mb_downloads/placement_files/tree.eukaryota_odb10.2019-12-16.nwk Success download from https://busco-data.ezlab.org/v5/data/placement_files/tree_metadata.eukaryota_odb10.2019-12-16.txt.tar.gz Placement file extraction path: mb_downloads/placement_files/tree_metadata.eukaryota_odb10.2019-12-16.txt Success download from https://busco-data.ezlab.org/v5/data/lineages/eukaryota_odb10.2020-09-10.tar.gz Lineage file extraction path: mb_downloads/eukaryota_odb10 Success download from https://busco-data.ezlab.org/v5/data/lineages/fungi_odb10.2021-06-28.tar.gz Lineage file extraction path: mb_downloads/fungi_odb10 lineage: fungi_odb10 [ERROR] failed to open/build the index Traceback (most recent call last): File "/home/lmoret/miniconda3/envs/checker/bin/compleasm", line 10, in sys.exit(main()) File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 2534, in main args.func(args) File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 2426, in run mr.Run() File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 2120, in Run alignment_output_dir) File "/home/lmoret/miniconda3/envs/checker/lib/python3.7/site-packages/compleasm.py", line 304, in run_miniprot raise Exception("miniprot exited with non-zero exit code: {}".format(exitcode)) Exception: miniprot exited with non-zero exit code: 1

huangnengCSU commented 10 months ago

To @larsmoret

The error "failed to open/build the index" is reported in miniprot. You can test the alignment manually by "miniprot --trans -u -I --outs=0.95 -t 20 --gff ~/finalassemblies/CBS1922.fasta mb_downloads/fungi_odb10/refseq_db.faa.gz > out.gff". I guess the problem occurs in creating the index of genome.

katiecdillon commented 10 months ago

Hello @huangnengCSU it looks like the output directory was in fact the issue. Thank you!

larsmoret commented 10 months ago

Hi @huangnengCSU, I've tried it again and manually downloaded the dependencies again, however I'm still facing difficulties. The most interesting part fo the log is stated below, does it maybe have to do with the quality of the assembly?

Kind regards, Lars Moret

[M::main] CMD: /data/volume_2/compleasm_kit/miniprot --trans -u -I --outs=0.95 -t 14 --gff finalassemblies/CBS.fasta mb_downloads/eukaryota_odb10/refseq_db.faa.gz [M::main] Real time: 72.284 sec; CPU: 957.367 sec; Peak RSS: 0.219 GB hmmsearch execute command: /data/volume_2/compleasm_kit/hmmsearch Warning: no reliable mappings found. All candidates do not pass the cutoff of BUSCO gene. Warning: No reliable hits found! Check the lineage file: eukaryota_odb10, alignment file: compleasmoutput/CBS/eukaryota_odb10/miniprot_output.gff, hmmsearch output folder: compleasmoutput/CBS/eukaryota_odb10/hmmer_output.

S:0.00%, 0 D:0.00%, 0 F:0.00%, 0 I:0.00%, 0 M:100.00%, 255 N:255

Download lineage: 0.00(s)

Run miniprot: 72.29(s)

Analyze miniprot: 46.34(s)

Total runtime: 118.63(s)

huangnengCSU commented 10 months ago

Hi @larsmoret,

All BUSCO genes are missing is because that there is no gene can be aligned to the assembly and pass the BUSCO's threshold, which means the genes are quite different from the assembly result. It may be the quality of assembly result or choosing the wrong lineage file. Meanwhile, if the assembly with high divergence, miniprot may not align well. Did you try BUSCO and how about the assessment result of BUSCO?