ocean-data-factory-sweden / kso

Notebooks to upload/download marine footage, connect to a citizen science project, train machine learning models and publish marine biological observations.
GNU General Public License v3.0
4 stars 12 forks source link

Tutorial 1 issue with movie data import #291

Closed Bergylta closed 9 months ago

Bergylta commented 9 months ago

🐛 Bug

A clear and concise description of what the bug is.

To Reproduce (REQUIRED)

in KSO-dev Project: Koster Seafloor Observatory Input:

pp.get_movie_info()

Output:

TypeError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 pp.get_movie_info()

File /usr/src/app/kso-dev/kso_utils/project.py:257, in ProjectProcessor.get_movie_info(self)
    247 def get_movie_info(self):
    248     """
    249     This function checks what movies from the movies csv are available and returns
    250     three df with those available in folder/server and movies.csv, only available
    251     in movies.csv and only available in folder/server
    252     """
    253     (
    254         self.available_movies_df,
    255         self.no_available_movies_df,
    256         self.no_info_movies_df,
--> 257     ) = movie_utils.retrieve_movie_info_from_server(
    258         project=self.project,
    259         db_connection=self.db_connection,
    260         server_connection=self.server_connection,
    261     )
    263     logging.info("Information of available movies has been retrieved")

File /usr/src/app/kso-dev/kso_utils/movie_utils.py:190, in retrieve_movie_info_from_server(project, db_connection, server_connection)
    187                 return s
    188         return None
--> 190     movies_df["fpath"] = movies_df["fpath"].apply(
    191         lambda x: get_match(x, mov_folder_df["fpath"].unique()),
    192         1,
    193     )
    195 # Merge the server path to the filepath
    196 all_movies_df = movies_df.merge(
    197     mov_folder_df,
    198     on=["fpath"],
    199     how="outer",
    200     indicator=True,
    201 )

File /usr/local/lib/python3.8/dist-packages/pandas/core/series.py:4430, in Series.apply(self, func, convert_dtype, args, **kwargs)
   4320 def apply(
   4321     self,
   4322     func: AggFuncType,
   (...)
   4325     **kwargs,
   4326 ) -> DataFrame | Series:
   4327     """
   4328     Invoke function on values of Series.
   4329 
   (...)
   4428     dtype: float64
   4429     """
-> 4430     return SeriesApply(self, func, convert_dtype, args, kwargs).apply()

File /usr/local/lib/python3.8/dist-packages/pandas/core/apply.py:1082, in SeriesApply.apply(self)
   1078 if isinstance(self.f, str):
   1079     # if we are a string, try to dispatch
   1080     return self.apply_str()
-> 1082 return self.apply_standard()

File /usr/local/lib/python3.8/dist-packages/pandas/core/apply.py:1137, in SeriesApply.apply_standard(self)
   1131         values = obj.astype(object)._values
   1132         # error: Argument 2 to "map_infer" has incompatible type
   1133         # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
   1134         # Dict[Hashable, Union[Union[Callable[..., Any], str],
   1135         # List[Union[Callable[..., Any], str]]]]]"; expected
   1136         # "Callable[[Any], Any]"
-> 1137         mapped = lib.map_infer(
   1138             values,
   1139             f,  # type: ignore[arg-type]
   1140             convert=self.convert_dtype,
   1141         )
   1143 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1144     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1145     #  See also GH#25959 regarding EA support
   1146     return obj._constructor_expanddim(list(mapped), index=obj.index)

File /usr/local/lib/python3.8/dist-packages/pandas/_libs/lib.pyx:2870, in pandas._libs.lib.map_infer()

File /usr/src/app/kso-dev/kso_utils/movie_utils.py:191, in retrieve_movie_info_from_server.<locals>.<lambda>(x)
    187                 return s
    188         return None
    190     movies_df["fpath"] = movies_df["fpath"].apply(
--> 191         lambda x: get_match(x, mov_folder_df["fpath"].unique()),
    192         1,
    193     )
    195 # Merge the server path to the filepath
    196 all_movies_df = movies_df.merge(
    197     mov_folder_df,
    198     on=["fpath"],
    199     how="outer",
    200     indicator=True,
    201 )

File /usr/src/app/kso-dev/kso_utils/movie_utils.py:183, in retrieve_movie_info_from_server.<locals>.get_match(string, string_options)
    182 def get_match(string, string_options):
--> 183     normalized_string = unicodedata.normalize("NFC", string)
    184     for s in string_options:
    185         normalized_s = unicodedata.normalize("NFC", s)

TypeError: normalize() argument 2 must be str, not None

Expected behavior

Expected to download movie data from server

Environment

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

jannesgg commented 9 months ago

Remember to add filename to final column in csv file to avoid this in future.

Bergylta commented 9 months ago

Add information about whether the movies are available in the movie_folder

df_toreview = df.merge(
    available_movies_df[["filename","fpath","exists"]],
    on=["filename"],
    how="left",
)
print(df_toreview)