ocean-data-factory-sweden / kso

Notebooks to upload/download marine footage, connect to a citizen science project, train machine learning models and publish marine biological observations.
GNU General Public License v3.0
4 stars 12 forks source link

KeyError: 'subject_type' in pp.process_zoo_classifications() #405

Closed donkyjohn closed 1 month ago

donkyjohn commented 1 month ago

🐛 Bug

I'm trying to rerun notebook 8 so that I can reselect how big the test size is for training a new model + I'd like to reselect which classes are used. I'm currently encountering the following bug when I try to process the zoo classifications:

To Reproduce (REQUIRED)

Input:

Notebook 8, project KSO

Output:

INFO:root:127 Zooniverse classifications have been retrieved from 125 subjects
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File ~/.local/lib/python3.10/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/.local/lib/python3.10/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'subject_type'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[8], line 1
----> 1 pp.process_zoo_classifications()

File ~/.local/lib/python3.10/site-packages/kso_utils/project.py:787, in ProjectProcessor.process_zoo_classifications(self, test)
    783     workflow_checks = self.workflow_widget.checks
    785 # Retrieve a subset of the subjects from the workflows of interest and
    786 # populate the sql subjects table and flatten the classifications provided the cit. scientists
--> 787 self.processed_zoo_classifications = zoo_utils.process_zoo_classifications(
    788     project=self.project,
    789     server_connection=self.server_connection,
    790     db_connection=self.db_connection,
    791     workflow_widget_checks=workflow_checks,
    792     workflows_df=self.zoo_info["workflows"],
    793     subjects_df=self.zoo_info["subjects"],
    794     csv_paths=self.csv_paths,
    795     classifications_data=self.zoo_info["classifications"],
    796     subject_type=workflow_checks["Subject type: #0"],
    797 )

File ~/.local/lib/python3.10/site-packages/kso_utils/zooniverse_utils.py:431, in process_zoo_classifications(project, server_connection, db_connection, workflow_widget_checks, workflows_df, subjects_df, csv_paths, classifications_data, subject_type)
    427 drop_table(conn=db_connection, table_name="subjects")
    429 if len(subjects_series) > 0:
    430     # Fill or re-fill subjects table
--> 431     populate_subjects(project, server_connection, db_connection, subjects_series)
    432 else:
    433     logging.error("No subjects to populate database from the workflows selected.")

File ~/.local/lib/python3.10/site-packages/kso_utils/zooniverse_utils.py:1143, in populate_subjects(project, server_connection, db_connection, subjects)
   1140 # Rename columns to match the db format
   1141 subjects = subjects.rename(columns=rename_cols)
-> 1143 if hasattr(subjects["subject_type"], "columns"):
   1144     # Avoid having two subject_type columns (one from Zoo one from the db)
   1145     subjects["subject_type0"] = subjects["subject_type"].iloc[:, 0]
   1146     subjects["subject_type1"] = subjects["subject_type"].iloc[:, 1]

File ~/.local/lib/python3.10/site-packages/pandas/core/frame.py:3506, in DataFrame.__getitem__(self, key)
   3504 if self.columns.nlevels > 1:
   3505     return self._getitem_multilevel(key)
-> 3506 indexer = self.columns.get_loc(key)
   3507 if is_integer(indexer):
   3508     indexer = [indexer]

File ~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
   3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:
-> 3623     raise KeyError(key) from err
   3624 except TypeError:
   3625     # If we have a listlike key, _check_indexing_error will raise
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.
   3628     self._check_indexing_error(key)

KeyError: 'subject_type'
jannesgg commented 1 month ago

@donkyjohn Not able to replicate this issue. Is it still happening after the latest git pull?

donkyjohn commented 1 month ago

It has been solved. Thanks for your time!

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg


From: Jannes @.> Sent: Thursday, May 16, 2024 1:24:32 PM To: ocean-data-factory-sweden/kso @.> Cc: Hoerée Benjamin @.>; Mention @.> Subject: Re: [ocean-data-factory-sweden/kso] KeyError: 'subject_type' in pp.process_zoo_classifications() (Issue #405)

@donkyjohnhttps://github.com/donkyjohn Not able to replicate this issue. Is it still happening after the latest git pull?

— Reply to this email directly, view it on GitHubhttps://github.com/ocean-data-factory-sweden/kso/issues/405#issuecomment-2114965776, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BEEWU46HXLSRPPR4N733BELZCSJPBAVCNFSM6AAAAABHX3WSWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJUHE3DKNZXGY. You are receiving this because you were mentioned.Message ID: @.***>