vak.util.split.train_test_dur_split throws error when crowsetta.Transcriber returns list

yardencsGitHub commented 4 years ago

I converted an older .ini to .toml following the shell script in https://github.com/yardencsGitHub/tweetynet/blob/63e4b5373589ece456b3738738c38d860a64f3b9/src/scripts/ini_to_toml.sh I'm attaching the resulting TOML file config-llb3-04232019.zip

The error message indicates issues with multiple parameters:

(tweetyenv_alpha) D:\YardenBirds\llb3_spring>vak prep config-llb3-04232019.toml Traceback (most recent call last): File "C:\Anaconda3\envs\tweetyenv_alpha\Scripts\vak-script.py", line 11, in load_entry_point('vak', 'console_scripts', 'vak')() File "c:\users\yarden cohen\repos\vak\src\vak__main__.py", line 43, in main config_file=args.configfile) File "c:\users\yarden cohen\repos\vak\src\vak\cli\cli.py", line 18, in cli prep(toml_path=config_file) File "c:\users\yarden cohen\repos\vak\src\vak\cli\prep.py", line 47, in prep cfg = config.parse.from_toml(toml_path, sections=['PREP', 'SPECTROGRAM', 'DATALOADER']) File "c:\users\yarden cohen\repos\vak\src\vak\config\parse.py", line 106, in from_toml are_options_valid(config_toml, section_name, toml_path) File "c:\users\yarden cohen\repos\vak\src\vak\config\validators.py", line 96, in are_options_valid f"the following options from {section} section in " ValueError: the following options from PREP section in the config file 'config-llb3-04232019.toml' are not valid: {'n_syllables', 'all_labels_are_int', 'silent_gap_label', 'mat_spect_files_path', 'freq_bins', 'mat_spect_files_annotation_file', 'skip_files_with_labels_not_in_labelset'}

NickleDave commented 4 years ago

Ah, sorry about the confusion, those conversion scripts work for [TRAIN] but not for [LEARNCURVE] .ini files.

I think some of the options here might be from an earlier version, too.

I rewrote your .ini file using the script in my head 🙄 😏

Can you try and let me know if vak prep works with it? I'm actually not sure if labelset will parse right with integers--it only just now dawned on me that might be an issue. Hopefully you will find the other options in the [PREP] section are now more intuitive.

Also not sure if vak will pick up the yarden format for crowsetta but it should if you pip installed both the TweetyNet package and vak into your environment.

config-llb3-04232019.toml.zip

NickleDave commented 4 years ago

just spotted a bug :( that I fixed in #100 I need to get a new version on pip but if you're working with an install from source like in #91 then you should be able to git pull the fix.

yardencsGitHub commented 4 years ago

Yes. I work from a cloned version of vak. If I 'git pull' do I also need to 'pip install -e .' again? (I'm not sure if the code is used as-is)

yardencsGitHub commented 4 years ago

The new .toml file worked (I also needed to add the output_dir to PREP but that was a quick fix)

vak prep did not create a directory for the output_dir (it crashed) so I created it myself.

Perhaps 'vak prep' should create directories if they are missing?

yardencsGitHub commented 4 years ago

Update, prep now stops with the following error:

(tweetyenv_alpha) D:\YardenBirds\llb3_spring>vak prep config-llb3-04232019.toml Traceback (most recent call last): File "C:\Anaconda3\envs\tweetyenv_alpha\Scripts\vak-script.py", line 11, in load_entry_point('vak', 'console_scripts', 'vak')() File "c:\users\yarden cohen\repos\vak\src\vak__main__.py", line 43, in main config_file=args.configfile) File "c:\users\yarden cohen\repos\vak\src\vak\cli\cli.py", line 18, in cli prep(toml_path=config_file) File "c:\users\yarden cohen\repos\vak\src\vak\cli\prep.py", line 143, in prep spect_params=cfg.spect_params) File "c:\users\yarden cohen\repos\vak\src\vak\io\dataframe.py", line 189, in from_files vak_df = spect.to_dataframe(**from_files_kwargs) File "c:\users\yarden cohen\repos\vak\src\vak\io\spect.py", line 139, in to_dataframe spect_annot_map = source_annot_map(spect_files, annot_list) File "c:\users\yarden cohen\repos\vak\src\vak\util\annotation.py", line 67, in source_annot_map source_files_stem = [_recursive_stem(sf) for sf in source_files_stem] File "c:\users\yarden cohen\repos\vak\src\vak\util\annotation.py", line 67, in source_files_stem = [_recursive_stem(sf) for sf in source_files_stem] File "c:\users\yarden cohen\repos\vak\src\vak\util\annotation.py", line 38, in _recursive_stem f'unable to compute stem of {path_str}' ValueError: unable to compute stem of D:\YardenBirds\llb3_spring\annotated\llb3_0002_2018_04_23_14_18_03.mat

I'll try to debug later but don't have time now

NickleDave commented 4 years ago

You shouldn't have to pip install again, the -e is for editable, so when git "edits" the files (by applying the commits), they are "installed" automatically

NickleDave commented 4 years ago

Agreed that it could be helpful for vak prep to make the output_dir if it doesn't exist -- let me think if there's any downside to that

NickleDave commented 4 years ago

Not sure why recursive_stem breaks, tho -- maybe because it's a Windows Path?

NickleDave commented 4 years ago

Was able to reproduce this bug on Ubuntu, so it's not a path issue

Must be some mistake with my logic in vak/io/dataframe -- digging into it now

NickleDave commented 4 years ago

@yardencsGitHub it's not a bug, although the current error message you got is not very useful.

The short answer is that it should work if you change all your file names to end in .wav.mat instead of just .mat

Let me try to explain why that is, how it happens, and then we can talk about possible changes / fixes:

You call vak prep my_config.toml
then vak.prep calls vak.io.dataframe.from_files
- this is the function that makes a Pandas DataFrame representing the "vocalization dataset" -- replacing the hard-to-maintain homemade data structure we had before
- prep will then save the returned DataFrame as a csv
vak.io.dataframe.from_files figures out that you are giving it a single annotation file and a bunch of spectrograms in .mat files
so it doesn't have to make any spectrograms, and instead it goes directly to calling the spect.to_dataframefunction, which is perhaps poorly named: this is the function that takes the final ingredients required to make a single 'vocalization dataset" for training a net--mainly, spect files and annotation--and organizes them into a DataFrame
to do this, vak.io.spect.to_dataframe has to figure out which spectrogram files go with which annotation files
- right now, the way it does that is by using the audio_file attribute of each Annotation instance, that Crowsetta makes by running the yarden2annot function on your annotation.mat files
- in other words vak.io.spect.to_dataframe matches each annotation with an audio or spectrogram file by assuming that either the audio or spectrogram file contains the name of the original audio file
- The way it matches the Annotations with the audio / spectrogram files, then, is by calling the source_annot_map from vak.util.annotation, which calls a helper function, vak.util.annotation._recursive_stem. This function removes the extension from a filename, leaving just the stem ( "stem" is the term that Python's pathlib uses to refer to the part of the filename before the extension). If this stem ends in the name of a valid audio format, e.g., wav, the function returns it.
- This is why the recursive_stem function raises an error. It can't find an audio file name in your .mat file names that it can pair with one of the Annotations. More precisely, it fails before it even looks for a matching audio file, because it can't find a valid audio format extension in the file name after "stem"ming it

This was the best way I could come up with to figure out which audio / spectrogram file goes with which annotation. Obviously, it forces the user to adopt a certain naming format for their spectrogram files. But what would the alternative be? Asking the user to provide their own list of spectrogram file-annotation file pairs? You can actually do this if you call vak.io.spect.to_dataframe as a function, by passing it a spect_annot_map (a Python dict where the keys are spectrogram files and the values are annotation files). But the idea is that would be for a user who really needs to do something outside the defaults and just can't live with our file naming convention. This way we keep the command-line interface as simple as possible and don't have to maintain any extra bells and whistles. As soon as someone needs those extra bells and whistles, we assume they have a certain level of literacy and can write a Pythons script to make their Dataframe / csv.

So again, short answer, please try it with your files renamed (obviously, please make a backup -- you have them on Drive, don't forget; that's how I tested) and happy to hear alternatives but this is the currently not very well documented but intended behavior for now

yardencsGitHub commented 4 years ago

Changing file names to .wav.mat allowed vak prep to continue.

(tweetyenv_alpha) D:\YardenBirds\llb3_spring>vak prep config-llb3-04232019.toml [########################################] | 100% Completed | 25.7s Traceback (most recent call last): File "C:\Anaconda3\envs\tweetyenv_alpha\Scripts\vak-script.py", line 11, in load_entry_point('vak', 'console_scripts', 'vak')() File "c:\users\yarden cohen\repos\vak\src\vak__main__.py", line 43, in main config_file=args.configfile) File "c:\users\yarden cohen\repos\vak\src\vak\cli\cli.py", line 18, in cli prep(toml_path=config_file) File "c:\users\yarden cohen\repos\vak\src\vak\cli\prep.py", line 152, in prep test_dur=cfg.prep.test_dur) File "c:\users\yarden cohen\repos\vak\src\vak\util\split.py", line 139, in train_test_dur_split for annot_file in vak_df['annot_path'].values] File "c:\users\yarden cohen\repos\vak\src\vak\util\split.py", line 139, in for annot_file in vak_df['annot_path'].values] AttributeError: 'list' object has no attribute 'seq'

yardencsGitHub commented 4 years ago

I debugged a little. The problem is that calling 'scribe.from_file(annot_file=annot_file)' returns an array. So it doesn't have the field '.seq' but scribe.from_file(annot_file=annot_file)[0] does.

@NickleDave, please check

NickleDave commented 4 years ago

I have checked, that's a bug. Thank you.

I think I will fix by adding a utility function that gets labels from a dataframe, and encapsulates control flow logic for handling different cases (a different annotation file for each audio file / row or a single annotation file for all rows)

NickleDave commented 4 years ago

I started a branch to add this function: https://github.com/NickleDave/vak/tree/add-labels-from-df

When I run it on your data I'm getting an error, but I think it's because the annotation file refers to audio files I don't have locally. The error I get is:

Traceback (most recent call last):
  File "/home/art/anaconda3/envs/vak-0.3.0a-torch/bin/vak", line 11, in <module>
    load_entry_point('vak', 'console_scripts', 'vak')()
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/__main__.py", line 43, in main
    config_file=args.configfile)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/cli.py", line 18, in cli
    prep(toml_path=config_file)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/prep.py", line 143, in prep
    spect_params=cfg.spect_params)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/io/dataframe.py", line 189, in from_files
    vak_df = spect.to_dataframe(**from_files_kwargs)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/io/spect.py", line 139, in to_dataframe
    spect_annot_map = source_annot_map(spect_files, annot_list)
  File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/util/annotation.py", line 88, in source_annot_map
    "Did not find a source file matching the following annotation: "
ValueError: Did not find a source file matching the following annotation: 
Annotation(annot_file='llb3/annotation_files/llb3_annotation_Apr_2019_emily_4TF.mat', audio_file='llb3_0002_2018_04_23_14_18_03.wav', seq=<Sequence with 269 segments>). Annotation has file set to 'llb3_0002_2018_04_23_14_18_03.wav'.

I am not sure right now if this is expected / desirable behavior. Basically source_annot_map tries to find an audio file for every annotation, and raises an error if it can't. This could be a problem with how the function searches for files (e.g. need to add the full path) or it could be because the file doesn't exist locally. Bigger question is whether we want prep to just use whatever audio/spect files it can find, or if we want it to complain when it can't find some of them. I prefer the latter because the former could fail silently, e.g. if a user points it at the wrong directory and it doesn't find any files period.

I need to stop working on this for now but if you can debug any more that would help.

You can test it yourself by running

$ git fetch origin
$ git checkout -t origin/add-labels-from-df

(To get back on master branch you execute git checkout master; you can see a list of branches with git branch and you can add the -r flag to see branches on the remote)

yardencsGitHub commented 4 years ago

Got it. I didn’t know you could run not from a local source. You should have all data and annotation files (in Google Drive)

On Mar 5, 2020, at 9:05 AM, David Nicholson notifications@github.com wrote:

I started a branch to add this function: https://github.com/NickleDave/vak/tree/add-labels-from-df https://github.com/NickleDave/vak/tree/add-labels-from-df When I run it on your data I'm getting an error, but I think it's because the annotation file refers to audio files I don't have locally. The error I get is:

Traceback (most recent call last): File "/home/art/anaconda3/envs/vak-0.3.0a-torch/bin/vak", line 11, in load_entry_point('vak', 'console_scripts', 'vak')() File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/main.py", line 43, in main config_file=args.configfile) File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/cli.py", line 18, in cli prep(toml_path=config_file) File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/prep.py", line 143, in prep spect_params=cfg.spect_params) File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/io/dataframe.py", line 189, in from_files vak_df = spect.to_dataframe(**from_files_kwargs) File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/io/spect.py", line 139, in to_dataframe spect_annot_map = source_annot_map(spect_files, annot_list) File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/util/annotation.py", line 88, in source_annot_map "Did not find a source file matching the following annotation: " ValueError: Did not find a source file matching the following annotation: Annotation(annot_file='llb3/annotation_files/llb3_annotation_Apr_2019_emily_4TF.mat', audio_file='llb3_0002_2018_04_23_14_18_03.wav', seq=<Sequence with 269 segments>). Annotation has file set to 'llb3_0002_2018_04_23_14_18_03.wav'. I am not sure right now if this is expected / desirable behavior. Basically source_annot_map tries to find an audio file for every annotation, and raises an error if it can't. This could be a problem with how the function searches for files (e.g. need to add the full path) or it could be because the file doesn't exist locally.

I need to stop working on this for now but if you can debug any more that would help.

You can test it yourself by running

$ git fetch origin $ git checkout -t origin/add-labels-from-df (To get back on master branch you execute git checkout master; you can see a list of branches with git branch and you can add the -r flag to see branches on the remote)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NickleDave/vak/issues/99?email_source=notifications&email_token=AEEFWKKM72FV6QXMAXROFF3RF6WSDA5CNFSM4K2THJOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN5LYKI#issuecomment-595246121, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEEFWKMREKYPF2FN2NWWKMDRF6WSDANCNFSM4K2THJOA.

NickleDave commented 4 years ago

Fixed by #103

vocalpy / vak

vak.util.split.train_test_dur_split throws error when crowsetta.Transcriber returns list #99