idrblab / AnnoPRO

Feature map and function annotation of Proteins
MIT License
30 stars 7 forks source link

ValueError("All arrays must be of the same length") #12

Open smilenaderi opened 1 year ago

smilenaderi commented 1 year ago

Bug Description

I tried to run it on the following fasta file it gives me this error:

>seq-2
MKKKKKKKLKKLKKKLKKKLKKKKKLLLLLLLLKKKKKKK
>seq-9
MKKKIKKIKKKIEKKKKKKLKKLKKKKKKKKLLLLLLLLL
>seq-10
MSEKFSEIAEKYDEERILSRSAGELAELTRELGLKPGDRVLDVGCGTGYLTLPLAERVGPEGTVIGIDRSEEMLARARERAAAAGLSNVEFQVADAEALPFPDESFDLVTCRLVLHHLPDPAKALREMRRVLKPGGRFVVSDWDASSMAFPDEEAELAERLRRYAEARAAAGGERDALRRALEAAGFRDVTVRSLTAWRRRAGEAAAAAL
>seq-13
MKKKKKLKKKLKKKKKKKK

Runtime Environment

Fresh install of requirements

Logs

annopro -i test_proteins.fasta -o output-test
Download cafa4.dmnd...
100% [........................................................................] 46988123 / 46988123
Validate md5sum of cafa4.dmnd...
diamond v2.1.0.154 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 4
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory: output-test
#Target sequences to report alignments for: 25
Opening the database...  [0.042s]
Database: /home/ubuntu/.annopro/data/cafa4.dmnd (type: Diamond database, sequences: 87514, letters: 44798577)
Block size = 2000000000
Opening the input file...  [0s]
Opening the output file...  [0s]
Loading query sequences...  [0s]
Masking queries...  [0.001s]
Algorithm: Double-indexed
Building query histograms...  [0s]
Loading reference sequences...  [0.055s]
Masking reference...  [0.588s]
Initializing temporary storage...  [0s]
Building reference histograms...  [0.493s]
Allocating buffers...  [0s]
Processing query block 1, reference block 1/1, shape 1/2, index chunk 1/4.
Building reference seed array...  [0.163s]
Building query seed array...  [0s]
Computing hash join...  [0.004s]
Masking low complexity seeds...  [0s]
Searching alignments...  [0s]
Deallocating memory...  [0s]
Processing query block 1, reference block 1/1, shape 1/2, index chunk 2/4.
Building reference seed array...  [0.192s]
Building query seed array...  [0s]
Computing hash join...  [0.002s]
Masking low complexity seeds...  [0s]
Searching alignments...  [0s]
Deallocating memory...  [0s]
Processing query block 1, reference block 1/1, shape 1/2, index chunk 3/4.
Building reference seed array...  [0.213s]
Building query seed array...  [0s]
Computing hash join...  [0.003s]
Masking low complexity seeds...  [0s]
Searching alignments...  [0s]
Deallocating memory...  [0s]
Processing query block 1, reference block 1/1, shape 1/2, index chunk 4/4.
Building reference seed array...  [0.154s]
Building query seed array...  [0s]
Computing hash join...  [0.003s]
Masking low complexity seeds...  [0s]
Searching alignments...  [0s]
Deallocating memory...  [0s]
Processing query block 1, reference block 1/1, shape 2/2, index chunk 1/4.
Building reference seed array...  [0.155s]
Building query seed array...  [0s]
Computing hash join...  [0.003s]
Masking low complexity seeds...  [0s]
Searching alignments...  [0s]
Deallocating memory...  [0s]
Processing query block 1, reference block 1/1, shape 2/2, index chunk 2/4.
Building reference seed array...  [0.19s]
Building query seed array...  [0s]
Computing hash join...  [0.003s]
Masking low complexity seeds...  [0s]
Searching alignments...  [0s]
Deallocating memory...  [0s]
Processing query block 1, reference block 1/1, shape 2/2, index chunk 3/4.
Building reference seed array...  [0.211s]
Building query seed array...  [0s]
Computing hash join...  [0.002s]
Masking low complexity seeds...  [0s]
Searching alignments...  [0s]
Deallocating memory...  [0s]
Processing query block 1, reference block 1/1, shape 2/2, index chunk 4/4.
Building reference seed array...  [0.154s]
Building query seed array...  [0s]
Computing hash join...  [0.004s]
Masking low complexity seeds...  [0s]
Searching alignments...  [0s]
Deallocating memory...  [0s]
Deallocating buffers...  [0.004s]
Clearing query masking...  [0s]
Computing alignments... Loading trace points...  [0.001s]
Sorting trace points...  [0s]
Computing alignments...  [0s]
Deallocating buffers...  [0s]
Loading trace points...  [0s]
 [0.002s]
Deallocating reference...  [0.002s]
Loading reference sequences...  [0s]
Deallocating buffers...  [0s]
Deallocating queries...  [0s]
Loading query sequences...  [0s]
Closing the input file...  [0s]
Closing the output file...  [0s]
Closing the database...  [0.002s]
Cleaning up...  [0s]
Total time = 2.766s
Reported 21 pairwise alignments, 21 HSPs.
1 queries aligned.
Invalid feature 0.6934-309 for seq-13 at line 596
Invalid feature 0.6934-309 for seq-13 at line 596
Invalid feature 0.5127-315 for seq-13 at line 596
Invalid feature 0.5127-315 for seq-13 at line 596
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/annopro/bin/annopro", line 8, in <module>
    sys.exit(console_main())
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/annopro/__init__.py", line 27, in console_main
    main(
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/annopro/__init__.py", line 71, in main
    process(
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/annopro/data_procession/__init__.py", line 8, in process
    data = Data_process(protein_file=profeat_file,
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/annopro/data_procession/data_predict.py", line 36, in __init__
    self.__data__()
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/annopro/data_procession/data_predict.py", line 39, in __data__
    proteins_f = profeat_to_df(self.protein_file)
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/profeat/__init__.py", line 69, in profeat_to_df
    return pd.DataFrame(feature_list).T
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/pandas/core/frame.py", line 636, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 502, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 120, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/home/ubuntu/anaconda3/envs/annopro/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 674, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
swallow-design commented 1 year ago

The error is likely due to a problem with profeat when calculating protein features, possibly because profeat cannot recognize your input sequence. If it is convenient for you, please provide us with the complete sequence file for analysis or use our website: https://idrblab.org/annopro

swallow-design commented 1 year ago

We recently reproduced the same bug during testing, and found that there were multiple protein sequences with the same ID. Perhaps you have encountered a similar problem and can investigate it.

Jialeen commented 7 months ago

The error is likely due to a problem with profeat when calculating protein features, possibly because profeat cannot recognize your input sequence. If it is convenient for you, please provide us with the complete sequence file for analysis or use our website: https://idrblab.org/annopro

I run with the data from https://idrblab.org/annopro, but the problem still exists: ValueError: All arrays must be of the same length

1813805349 commented 7 months ago

The error is likely due to a problem with profeat when calculating protein features, possibly because profeat cannot recognize your input sequence. If it is convenient for you, please provide us with the complete sequence file for analysis or use our website: https://idrblab.org/annopro

I run with the data from https://idrblab.org/annopro, but the problem still exists: ValueError: All arrays must be of the same length

This problem is caused by the amino acid sequence length being less than 30 during profeat.