hearbenchmark / hear-eval-kit

Evaluation kit for the HEAR Benchmark
https://hearbenchmark.com
Apache License 2.0
55 stars 17 forks source link

Bug in slugifying filenames with dashes #203

Closed jorshi closed 3 years ago

jorshi commented 3 years ago

In the luigi pipeline metadata dataframe we slugify the relative path of the filename. This is broken for filenames that contain a dash character. For example: test_1_ebr_-6_nec_4_poly_1.wav in dcase. This slugifies to test-1-ebr-6-nec-4-poly-1 , which is the same as what test_1_ebr_6_nec_4_poly_1.wav slugifies to.

jorshi commented 3 years ago

Potential solution is to use the replacement arg in slugify (we would override the slugify_file_name method in ExtractMetadata for the dcase task) to replace - with negative_. See https://github.com/un33k/python-slugify

For example:

slugify(str(Path(relative_path).stem), replacements=[["-", "negative_"]])
jorshi commented 3 years ago

We should also do a sanity check in the ExtractMetadata run function to make sure that all the slugs are unique. i.e. something like:

assert len(process_metadata["relpath"].unique()) == len(process_metadata["slug"].unique())