ocean-data-factory-sweden / kso

Notebooks to upload/download marine footage, connect to a citizen science project, train machine learning models and publish marine biological observations.
GNU General Public License v3.0
4 stars 12 forks source link

Issues regarding initialising the project processor and receiving the annotations from Zooniverse #398

Closed donkyjohn closed 1 month ago

donkyjohn commented 2 months ago

Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:

If this is a custom dataset/training question you must include your train*.jpg, test*.jpg and results.png figures, or we can not help you. You can generate these with utils.plot_results().

🐛 Bug

If Koster_Seafloor_Obs is selected as project, initialising the project processor gives the following Error: ERROR:root:UNIQUE constraint failed: photos.filename.

Next:

The zooniverse classifications are retrieved but i also receive the following Error: INFO:root:127 Zooniverse classifications have been retrieved from 125 subjects

KeyError Traceback (most recent call last) File ~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err:

File ~/.local/lib/python3.10/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/.local/lib/python3.10/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'subject_type'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last) Cell In[8], line 2 1 # Get the classifications that were added manually ----> 2 pp.process_zoo_classifications()

File ~/kso/kso_utils/project.py:787, in ProjectProcessor.process_zoo_classifications(self, test) 783 workflow_checks = self.workflow_widget.checks 785 # Retrieve a subset of the subjects from the workflows of interest and 786 # populate the sql subjects table and flatten the classifications provided the cit. scientists --> 787 self.processed_zoo_classifications = zoo_utils.process_zoo_classifications( 788 project=self.project, 789 server_connection=self.server_connection, 790 db_connection=self.db_connection, 791 workflow_widget_checks=workflow_checks, 792 workflows_df=self.zoo_info["workflows"], 793 subjects_df=self.zoo_info["subjects"], 794 csv_paths=self.csv_paths, 795 classifications_data=self.zoo_info["classifications"], 796 subject_type=workflow_checks["Subject type: #0"], 797 )

File ~/kso/kso_utils/zooniverse_utils.py:431, in process_zoo_classifications(project, server_connection, db_connection, workflow_widget_checks, workflows_df, subjects_df, csv_paths, classifications_data, subject_type) 427 drop_table(conn=db_connection, table_name="subjects") 429 if len(subjects_series) > 0: 430 # Fill or re-fill subjects table --> 431 populate_subjects(project, server_connection, db_connection, subjects_series) 432 else: 433 logging.error("No subjects to populate database from the workflows selected.")

File ~/kso/kso_utils/zooniverse_utils.py:1143, in populate_subjects(project, server_connection, db_connection, subjects) 1140 # Rename columns to match the db format 1141 subjects = subjects.rename(columns=rename_cols) -> 1143 if hasattr(subjects["subject_type"], "columns"): 1144 # Avoid having two subject_type columns (one from Zoo one from the db) 1145 subjects["subject_type0"] = subjects["subject_type"].iloc[:, 0] 1146 subjects["subject_type1"] = subjects["subject_type"].iloc[:, 1]

File ~/.local/lib/python3.10/site-packages/pandas/core/frame.py:3506, in DataFrame.getitem(self, key) 3504 if self.columns.nlevels > 1: 3505 return self._getitem_multilevel(key) -> 3506 indexer = self.columns.get_loc(key) 3507 if is_integer(indexer): 3508 indexer = [indexer]

File ~/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance) 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: -> 3623 raise KeyError(key) from err 3624 except TypeError: 3625 # If we have a listlike key, _check_indexing_error will raise 3626 # InvalidIndexError. Otherwise we fall through and re-raise 3627 # the TypeError. 3628 self._check_indexing_error(key)

KeyError: 'subject_type'

To Reproduce (REQUIRED)

notebook was run via cloudina. Input:

import torch

a = torch.tensor([5])
c = a / 0

Output:

Traceback (most recent call last):
  File "/Users/glennjocher/opt/anaconda3/envs/env1/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-be04c762b799>", line 5, in <module>
    c = a / 0
RuntimeError: ZeroDivisionError

Expected behavior

A clear and concise description of what you expected to happen.

Environment

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

donkyjohn commented 2 months ago

Manually removing the duplicates from the csv file doesn't help, it changes the error to ERROR:root:NOT NULL constraint failed: photos.filename when i try to initiate the processor. Files are stored in zooniverse with .JPG, could it be that this is causing any issues?

jannesgg commented 2 months ago

@donkyjohn Please check whether the latest changes to dev address these issues.