vocalpy / crowsetta

A tool to work with any format for annotating vocalizations
https://crowsetta.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
49 stars 3 forks source link

refactor phn fixtures + unit tests to remove duplication (used to be required because of filenames) #107

Closed NickleDave closed 2 years ago

NickleDave commented 2 years ago

right now the tests and the data for the 'phn' format from the TIMIT dataset are duplicated

there's a good reason for the data to be duplicated: the .wav files from one version use a NIST format that can be parsed by soundfile but not by scipy.wavefile, as described in this README in the test data: https://github.com/NickleDave/crowsetta/blob/main/tests/data_for_tests/audio_WAV_annot_PHN/README.md

so we want to make sure we can parse both

but there's not a very good reason to duplicate the unit tests, literally the only thing that's different is the fixture we pass in

this is one of those cases where it would be convenient to be able to use fixtures to parametrize tests

there's a workaround here: https://docs.pytest.org/en/6.2.x/proposals/parametrize_with_fixtures.html but I have read this multiple times I am in this situation and still am not sure whether it actually works

and I end up just creating a factory function that accepts e.g. the extension as an argument (e.g., list_of_phns(format='PHN')) which works fine

NickleDave commented 2 years ago

This is also an issue on Mac, I just realized :frowning_face:
So basically trying to contribute blows up in your face when you clone the repo, if you're on any system besides linux.
Time to fix.

I do think it makes sense to just distinguish them "semantically"

NickleDave commented 2 years ago

So the original issue was about duplicating the tests.

Renaming fixes the clobbering but we still have duplicated tests, and we're not really using the different audio file formats even, as far as I can tell

NickleDave commented 2 years ago

Correction: we definitely use the different audio file formats; the .phn format reports onsets and offsets in terms of sample number (i.e., index), and we convert to seconds using the sampling rate from the file

NickleDave commented 2 years ago

I think the 'phn' format should have the same logic as 'birdsong-recognition-dataset', where we provide the option to specify a location where the .wav files can be found, but default to None, and in the case of that default, we look in the parent of the .phn path, i.e. we assume the file has the same name but with .ext .wav or .WAV instead of .phn

There's an extra step here because we have to check for either .wav or .WAV, but overall it's the same idea

Will make a separate issue for this

NickleDave commented 2 years ago

I can remove duplication of the tests by globbing all the PHN files together, and letting the annotation format itself check for the .wav as I just described in the previous comment

This is best done by having lists of paths declared outside of fixtures so that the globbing can happen programatically. This makes it possible later to assign subsets of the constants to parametrized fixtures, e.g. if I only want to check 'wav-nist' format PHNs.

I'm already in the middle of refactoring fixtures this way in the version-4.0 branch.

So I'm going to leave this open but make it an issue for that branch

NickleDave commented 2 years ago

Fixed in #161