jackievaleri / BioAutoMATED

Automated machine learning for analyzing, interpreting, and designing biological sequences
MIT License
162 stars 20 forks source link

IndexError: single positional indexer is out-of-bounds #8

Open shrutiOx opened 2 days ago

shrutiOx commented 2 days ago

Hello,

I hope you are well. I just wanted to ask regarding the background of this aforementioned error which is generating when am trying to evaluate an independent test dataset for proteins (max padded). Below are the code details.

Thank you very much for your kind help and also for this great package! CODE:

data_folder = './clean_data/clean/' data_file = 'test_preacrs.csv'

input_col= 'sequence' target_col = 'Labels' pad_seqs = 'max' augment_data = 'none' sequence_type = 'protein'

model_folder = './exemplars/test/models/' output_folder = './exemplars/test/outputs/' model_type = 'autokeras' task = 'binary_classification' class_of_interest = 1 # 1 for binary classification typically

cutoff_true = 1 cutoff_pred = 0.5 # use 0.5 as predicted ys cut-off, since they will max out at 1

read_in_format_data_and_pred(task, data_folder, data_file, input_col, target_col, pad_seqs, augment_data, sequence_type, model_type, model_folder, output_folder, class_of_interest = class_of_interest, cutoff_true = cutoff_true, cutoff_pred = cutoff_pred);

ERROR: Warning: Unknown letter(s) " " found in sequence Example of bad letter : LKKTIEKLLNSDLNSNYIAKKTGVEQSTIYRLRTGERQLGKLGLDSAERLYNYQKEIE NMKSVKYISNMSKQEKGYRVYVNVVNEDTDKGFLFPSVPKEVIENDKIDELFNFEH HKPYVQKAKSRYDKNGIGYKIVQLDEGFQKFIELNKEKMKENLDY Padding all sequences to a length of 348 Confirmed: No data augmentation requested Confirmed: Scrambled control generated.

IndexError Traceback (most recent call last)

in 22 cutoff_pred = 0.5 # use 0.5 as predicted ys cut-off, since they will max out at 1 23 ---> 24 read_in_format_data_and_pred(task, data_folder, data_file, input_col, target_col, pad_seqs, augment_data, sequence_type, model_type, model_folder, output_folder, class_of_interest = class_of_interest, cutoff_true = cutoff_true, cutoff_pred = cutoff_pred); /BioAutoMATED/main_classes/transfer_learning_helpers.py in read_in_format_data_and_pred(task, data_folder, data_file, input_col, target_col, pad_seqs, augment_data, sequence_type, model_type, model_folder, output_folder, class_of_interest, cutoff_true, cutoff_pred, stats) 244 preds = AutoMLBackend.generic_predict(oh_data_input, numerical_data_input, model_type, final_model_path, final_model_name) 245 preddf = pd.DataFrame(preds) --> 246 y_pred = preddf.iloc[:,class_of_interest] 247 y_true = list(df_data_output.iloc[:,0]) 248 if stats: /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key) 871 # AttributeError for IntervalTree get_value 872 pass --> 873 return self._getitem_tuple(key) 874 else: 875 # we by definition only have the 0th axis /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup) 1441 def _getitem_tuple(self, tup: Tuple): 1442 -> 1443 self._has_valid_tuple(tup) 1444 try: 1445 return self._getitem_lowerdim(tup) /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key) 700 raise IndexingError("Too many indexers") 701 try: --> 702 self._validate_key(k, i) 703 except ValueError as err: 704 raise ValueError( /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis) 1350 return 1351 elif is_integer(key): -> 1352 self._validate_integer(key, axis) 1353 elif isinstance(key, tuple): 1354 # a tuple should already have been caught by this point /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_integer(self, key, axis) 1435 len_axis = len(self.obj._get_axis(axis)) 1436 if key >= len_axis or key < -len_axis: -> 1437 raise IndexError("single positional indexer is out-of-bounds") 1438 1439 # ------------------------------------------------------------------- IndexError: single positional indexer is out-of-bounds
shrutiOx commented 2 days ago

Adding a note:

This error is coming up in this case too.

data_folder = './clean_data/clean/' data_file = 'small_synthetic.csv'

pad_seqs = 'none' augment_data = 'none'

input_col = 'seq' target_col = 'positive_score' sequence_type = 'nucleic_acid'

model_folder = './exemplars/test/models/' output_folder = './exemplars/test/outputs/' model_type = 'autokeras' task = 'binary_classification' class_of_interest = 1 # 1 for binary classification typically

cutoff_true = 1 cutoff_pred = 0.5 # use 0.5 as predicted ys cut-off, since they will max out at 1 read_in_format_data_and_pred(task, data_folder, data_file, input_col, target_col, pad_seqs, augment_data, sequence_type, model_type, model_folder, output_folder, class_of_interest = class_of_interest, cutoff_true = cutoff_true, cutoff_pred = cutoff_pred);

ERROR: Confirmed: All sequence characters are in alphabet Confirmed: No need to pad or truncate, all sequences same length Confirmed: No data augmentation requested Confirmed: Scrambled control generated.

IndexError Traceback (most recent call last)

in 22 cutoff_true = 1 23 cutoff_pred = 0.5 # use 0.5 as predicted ys cut-off, since they will max out at 1 ---> 24 read_in_format_data_and_pred(task, data_folder, data_file, input_col, target_col, pad_seqs, augment_data, sequence_type, model_type, model_folder, output_folder, class_of_interest = class_of_interest, cutoff_true = cutoff_true, cutoff_pred = cutoff_pred); 25 26 # # Peptides - regression model example /BioAutoMATED/main_classes/transfer_learning_helpers.py in read_in_format_data_and_pred(task, data_folder, data_file, input_col, target_col, pad_seqs, augment_data, sequence_type, model_type, model_folder, output_folder, class_of_interest, cutoff_true, cutoff_pred, stats) 244 preds = AutoMLBackend.generic_predict(oh_data_input, numerical_data_input, model_type, final_model_path, final_model_name) 245 preddf = pd.DataFrame(preds) --> 246 y_pred = preddf.iloc[:,class_of_interest] 247 y_true = list(df_data_output.iloc[:,0]) 248 if stats: /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key) 871 # AttributeError for IntervalTree get_value 872 pass --> 873 return self._getitem_tuple(key) 874 else: 875 # we by definition only have the 0th axis /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup) 1441 def _getitem_tuple(self, tup: Tuple): 1442 -> 1443 self._has_valid_tuple(tup) 1444 try: 1445 return self._getitem_lowerdim(tup) /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key) 700 raise IndexingError("Too many indexers") 701 try: --> 702 self._validate_key(k, i) 703 except ValueError as err: 704 raise ValueError( /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis) 1350 return 1351 elif is_integer(key): -> 1352 self._validate_integer(key, axis) 1353 elif isinstance(key, tuple): 1354 # a tuple should already have been caught by this point /miniconda/envs/automl_py37/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_integer(self, key, axis) 1435 len_axis = len(self.obj._get_axis(axis)) 1436 if key >= len_axis or key < -len_axis: -> 1437 raise IndexError("single positional indexer is out-of-bounds") 1438 1439 # ------------------------------------------------------------------- IndexError: single positional indexer is out-of-bounds