vocalpy / crowsetta

A tool to work with any format for annotating vocalizations
https://crowsetta.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
49 stars 3 forks source link

BUG: function `crowsetta.formats.seq.generic.csv2annot` does not use validated DataFrame #257

Closed NickleDave closed 1 year ago

NickleDave commented 1 year ago

Before submitting a bug, please make sure the issue hasn't been already addressed by searching through the past issues

Describe the bug

I'm seeing a bug upstream in vak where we convert to generic-seq then load from the converted file and find that all the labels are integers even though validation with the Pandera schema should coerce them to strings.

This causes "vak prep" to skip all files in the dataset, since none of them appear to have labels in the labelset. Then vak throws a confusing error since it ends up in a weird condition, no files left after filtering by labelset.

Expected behavior The labels for generic-seq should always be loaded as strings

Additional context The bug happens here where we fail to re-assign the variable df to the validated / coerced dataframe returned by Pandera:
https://github.com/vocalpy/crowsetta/blob/70295c7b19025657b5d5a3d5521ee278af205fb2/src/crowsetta/formats/seq/generic.py#L230

We should instead do the same thing we do in other format classes, that is, re-assign the variable:
https://github.com/vocalpy/crowsetta/blob/70295c7b19025657b5d5a3d5521ee278af205fb2/src/crowsetta/formats/seq/simple.py#L176