Closed yardencsGitHub closed 4 years ago
Ah, sorry about the confusion, those conversion scripts work for [TRAIN]
but not for [LEARNCURVE]
.ini files.
I think some of the options here might be from an earlier version, too.
I rewrote your .ini file using the script in my head 🙄 😏
Can you try and let me know if vak prep
works with it?
I'm actually not sure if labelset
will parse right with integers--it only just now dawned on me that might be an issue. Hopefully you will find the other options in the [PREP]
section are now more intuitive.
Also not sure if vak
will pick up the yarden
format for crowsetta
but it should if you pip install
ed both the TweetyNet
package and vak
into your environment.
just spotted a bug :( that I fixed in #100
I need to get a new version on pip
but if you're working with an install from source like in #91 then you should be able to git pull
the fix.
Yes. I work from a cloned version of vak. If I 'git pull' do I also need to 'pip install -e .' again? (I'm not sure if the code is used as-is)
The new .toml file worked (I also needed to add the output_dir to PREP but that was a quick fix)
vak prep did not create a directory for the output_dir (it crashed) so I created it myself.
Perhaps 'vak prep' should create directories if they are missing?
Update, prep now stops with the following error:
(tweetyenv_alpha) D:\YardenBirds\llb3_spring>vak prep config-llb3-04232019.toml
Traceback (most recent call last):
File "C:\Anaconda3\envs\tweetyenv_alpha\Scripts\vak-script.py", line 11, in
I'll try to debug later but don't have time now
You shouldn't have to pip install
again, the -e
is for editable, so when git
"edits" the files (by applying the commits), they are "installed" automatically
Agreed that it could be helpful for vak prep
to make the output_dir
if it doesn't exist -- let me think if there's any downside to that
Not sure why recursive_stem
breaks, tho -- maybe because it's a Windows Path?
Was able to reproduce this bug on Ubuntu, so it's not a path issue
Must be some mistake with my logic in vak/io/dataframe
-- digging into it now
@yardencsGitHub it's not a bug, although the current error message you got is not very useful.
The short answer is that it should work if you change all your file names to end in .wav.mat
instead of just .mat
Let me try to explain why that is, how it happens, and then we can talk about possible changes / fixes:
vak prep my_config.toml
vak.prep
calls vak.io.dataframe.from_files
prep
will then save the returned DataFrame as a csvvak.io.dataframe.from_files
figures out that you are giving it a single annotation file and a bunch of spectrograms in .mat
filesspect.to_dataframe
function, which is perhaps poorly named: this is the function that takes the final ingredients required to make a single 'vocalization dataset" for training a net--mainly, spect files and annotation--and organizes them into a DataFrame
vak.io.spect.to_dataframe
has to figure out which spectrogram files go with which annotation files
audio_file
attribute of each Annotation
instance, that Crowsetta
makes by running the yarden2annot
function on your annotation.mat filesvak.io.spect.to_dataframe
matches each annotation with an audio or spectrogram file by assuming that either the audio or spectrogram file contains the name of the original audio fileAnnotation
s with the audio / spectrogram files, then, is by calling the source_annot_map
from vak.util.annotation
, which calls a helper function, vak.util.annotation._recursive_stem
. This function removes the extension from a filename, leaving just the stem ( "stem" is the term that Python's pathlib
uses to refer to the part of the filename before the extension). If this stem ends in the name of a valid audio format, e.g., wav
, the function returns it. recursive_stem
function raises an error. It can't find an audio file name in your .mat file names that it can pair with one of the Annotation
s. More precisely, it fails before it even looks for a matching audio file, because it can't find a valid audio format extension in the file name after "stem"ming itThis was the best way I could come up with to figure out which audio / spectrogram file goes with which annotation.
Obviously, it forces the user to adopt a certain naming format for their spectrogram files.
But what would the alternative be?
Asking the user to provide their own list of spectrogram file-annotation file pairs?
You can actually do this if you call vak.io.spect.to_dataframe
as a function, by passing it a spect_annot_map
(a Python dict where the keys are spectrogram files and the values are annotation files).
But the idea is that would be for a user who really needs to do something outside the defaults and just can't live with our file naming convention.
This way we keep the command-line interface as simple as possible and don't have to maintain any extra bells and whistles.
As soon as someone needs those extra bells and whistles, we assume they have a certain level of literacy and can write a Pythons script to make their Dataframe / csv.
So again, short answer, please try it with your files renamed (obviously, please make a backup -- you have them on Drive, don't forget; that's how I tested) and happy to hear alternatives but this is the currently not very well documented but intended behavior for now
Changing file names to .wav.mat allowed vak prep to continue.
(tweetyenv_alpha) D:\YardenBirds\llb3_spring>vak prep config-llb3-04232019.toml
[########################################] | 100% Completed | 25.7s
Traceback (most recent call last):
File "C:\Anaconda3\envs\tweetyenv_alpha\Scripts\vak-script.py", line 11, in
I debugged a little. The problem is that calling 'scribe.from_file(annot_file=annot_file)' returns an array. So it doesn't have the field '.seq' but scribe.from_file(annot_file=annot_file)[0] does.
@NickleDave, please check
I have checked, that's a bug. Thank you.
I think I will fix by adding a utility function that gets labels from a dataframe, and encapsulates control flow logic for handling different cases (a different annotation file for each audio file / row or a single annotation file for all rows)
I started a branch to add this function: https://github.com/NickleDave/vak/tree/add-labels-from-df
When I run it on your data I'm getting an error, but I think it's because the annotation file refers to audio files I don't have locally. The error I get is:
Traceback (most recent call last):
File "/home/art/anaconda3/envs/vak-0.3.0a-torch/bin/vak", line 11, in <module>
load_entry_point('vak', 'console_scripts', 'vak')()
File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/__main__.py", line 43, in main
config_file=args.configfile)
File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/cli.py", line 18, in cli
prep(toml_path=config_file)
File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/cli/prep.py", line 143, in prep
spect_params=cfg.spect_params)
File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/io/dataframe.py", line 189, in from_files
vak_df = spect.to_dataframe(**from_files_kwargs)
File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/io/spect.py", line 139, in to_dataframe
spect_annot_map = source_annot_map(spect_files, annot_list)
File "/home/art/Documents/repos/coding/birdsong/vak/src/vak/util/annotation.py", line 88, in source_annot_map
"Did not find a source file matching the following annotation: "
ValueError: Did not find a source file matching the following annotation:
Annotation(annot_file='llb3/annotation_files/llb3_annotation_Apr_2019_emily_4TF.mat', audio_file='llb3_0002_2018_04_23_14_18_03.wav', seq=<Sequence with 269 segments>). Annotation has file set to 'llb3_0002_2018_04_23_14_18_03.wav'.
I am not sure right now if this is expected / desirable behavior.
Basically source_annot_map
tries to find an audio file for every annotation, and raises an error if it can't.
This could be a problem with how the function searches for files (e.g. need to add the full path) or it could be because the file doesn't exist locally.
Bigger question is whether we want prep
to just use whatever audio/spect files it can find, or if we want it to complain when it can't find some of them. I prefer the latter because the former could fail silently, e.g. if a user points it at the wrong directory and it doesn't find any files period.
I need to stop working on this for now but if you can debug any more that would help.
You can test it yourself by running
$ git fetch origin
$ git checkout -t origin/add-labels-from-df
(To get back on master branch you execute git checkout master
; you can see a list of branches with git branch
and you can add the -r
flag to see branches on the remote)
Got it. I didn’t know you could run not from a local source. You should have all data and annotation files (in Google Drive)
On Mar 5, 2020, at 9:05 AM, David Nicholson notifications@github.com wrote:
I started a branch to add this function: https://github.com/NickleDave/vak/tree/add-labels-from-df https://github.com/NickleDave/vak/tree/add-labels-from-df When I run it on your data I'm getting an error, but I think it's because the annotation file refers to audio files I don't have locally. The error I get is:
Traceback (most recent call last):
File "/home/art/anaconda3/envs/vak-0.3.0a-torch/bin/vak", line 11, in
I need to stop working on this for now but if you can debug any more that would help.
You can test it yourself by running
$ git fetch origin $ git checkout -t origin/add-labels-from-df (To get back on master branch you execute git checkout master; you can see a list of branches with git branch and you can add the -r flag to see branches on the remote)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NickleDave/vak/issues/99?email_source=notifications&email_token=AEEFWKKM72FV6QXMAXROFF3RF6WSDA5CNFSM4K2THJOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN5LYKI#issuecomment-595246121, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEEFWKMREKYPF2FN2NWWKMDRF6WSDANCNFSM4K2THJOA.
Fixed by #103
I converted an older .ini to .toml following the shell script in https://github.com/yardencsGitHub/tweetynet/blob/63e4b5373589ece456b3738738c38d860a64f3b9/src/scripts/ini_to_toml.sh I'm attaching the resulting TOML file config-llb3-04232019.zip
The error message indicates issues with multiple parameters:
(tweetyenv_alpha) D:\YardenBirds\llb3_spring>vak prep config-llb3-04232019.toml Traceback (most recent call last): File "C:\Anaconda3\envs\tweetyenv_alpha\Scripts\vak-script.py", line 11, in
load_entry_point('vak', 'console_scripts', 'vak')()
File "c:\users\yarden cohen\repos\vak\src\vak__main__.py", line 43, in main
config_file=args.configfile)
File "c:\users\yarden cohen\repos\vak\src\vak\cli\cli.py", line 18, in cli
prep(toml_path=config_file)
File "c:\users\yarden cohen\repos\vak\src\vak\cli\prep.py", line 47, in prep
cfg = config.parse.from_toml(toml_path, sections=['PREP', 'SPECTROGRAM', 'DATALOADER'])
File "c:\users\yarden cohen\repos\vak\src\vak\config\parse.py", line 106, in from_toml
are_options_valid(config_toml, section_name, toml_path)
File "c:\users\yarden cohen\repos\vak\src\vak\config\validators.py", line 96, in are_options_valid
f"the following options from {section} section in "
ValueError: the following options from PREP section in the config file 'config-llb3-04232019.toml' are not valid:
{'n_syllables', 'all_labels_are_int', 'silent_gap_label', 'mat_spect_files_path', 'freq_bins', 'mat_spect_files_annotation_file', 'skip_files_with_labels_not_in_labelset'}