Closed dstanner closed 2 years ago
Hi Darren,
In brief: The recommended way for going about this is to put a bunch of wav files with associated annotation files (filename1.wav + filename1_annotations.csv, ...) into a folder and then use the GUI via the "DAS/Make dataset for training" menu to prepare the data for training. This will split the data into train/val/test sets, concatenate files, create the one-hot-encoded training targets, and result in the data structure described here.
Details:
npy
files for the audio and the labels plus some metadata. The top part of the annotation files is to make it easier for users to prepare their own data as just a bunch of wav and csv files and then use the GUI for making the actual dataset. This is the recommended way for all users to ensure that everything is in the right format but you absolutely can make the dataset yourself:
.npy
for DAS to properly recognize the format.train
/test
/val
for the different data splits.x.npy
file with the audio samples (concatenated from multiple individual files) and ay.npy
with the the one-hot-encoded training targets for each sample. The targets have format [sample, number of classes]
- you need to include a noise/no signal class at index 0.attrs.npy
file which is simply a python dictionary saved using np.save
with the following keys:samplerate
: of x and y in Hz as a float.class_names
: name for each class as a list of strings, first class name should be noise
.class_types
: type of each class as a list of strings, same length as class_names. Allowed string values are event
and segment
(in most cases, segment
is what you want).Let me know if you have any more questions and if you run into issues.
Two questions:
The information in thee data formats page seems to assume that the data will be contained within a single file. What if data is spread across multiple input wav files each with a separate set of annotations? One possibility might be to concatenate the wav files into a single numpy array, but this doesn't seem satisfying as it could lead to edge artifacts where the signals are appended. Are there other ways to manage inputs that are distributed across multiple recordings?
I'm a bit confused by the information describing audio/annotations/song definitions (at the top of the page) and the data structure used for training (at the bottom). How are the data and annotation files related to the data structures for training? It appears that the data structures for training are assuming that the data is already stored in memory (e.g., the wav files are converted to numpy arrays and the labels are converted to a series of labels for each time point). Is that correct? It's not clear how one is supposed to go from the files described at the top to the data structures described at the bottom.
If the data structure assumes all of the data is a dictionary in memory, it seems that the top part about the annotation files can just be skipped if we are converting our labels into time series ourselves, right? Just create our own pre-processing and load the data into memory?
If so, we'd still need to figure out a way to deal with multiple input files, since I'm presuming that
data['train']['x']
is a single array with the entire series of audio values for all of the training data, right?