Open yongtang opened 4 years ago
Thank you for writing up this detailed issue, @yongtang !
I'll try to connect the current API with what would be use cases in a ML context:
AudioIODataset
: We have an audio stream we want to process start to end but which is so big that we do not want to load it completely into memory. I think this mostly makes sense with unidirectional RNN-based models, e.g. seq2seq event classification on audio.IOTensor
: We have a big audio file from which we only want specific parts. Use case could be larger audio tracks for which we have partial labelling. So maybe data is a 1h audio track from a presentation and we have annotations like 10:10-10:30 applause, 15:15-15:25 laughter, ... which we want to detect.eager decode
: We have comparatively small files we process completely. Examples are common ASR corpora like LibriSpeech and Mozilla CommonVoice (samples are sentences encoded as MP3/FLAC, typically <10s). Also useful if we don't have the files in raw form, but e.g. in TFRecord
s or some database.I think all of these are valid.
W.r.t. desired_channels
and desired_samples
you have convinced me that we do not need to add it to the operators. Maybe we could keep it as Python-based helper function in the audio module though. I'm a big fan of enforcing explicit shapes - few errors are more annoying than having your training crash after hours because for some reason one file in your dataset is stereo instead of mono.
For data type, why can we not simply make it float32
normalized to [-1,+1]? This is usually what you would do anyways and also what downstream functions from tf.signal
expect. If you need "raw" samples for whatever reason you can just scale it back up yourself.
@jjedele for eager decode
actually it should be graph or eager decode
, as we could make the decode work in both graph and eager mode.
The decode
op itself is a very useful scenario that could be couple with other APIs to build a full continuous pipeline. For example, we could combine Kakfa with audio decode, so that kafka is used to read incoming audio data continuously and then decode:
dataset = tfio.IODataset.from_kafka()
dataset = dataset.map(lambda message: tfio.audio.decode_mp3(message)
# further processing
...
For always make float32
: a semi-related issue https://github.com/tensorflow/tensorflow/issues/1763 probably explains why we want to defer the conversion to use explicitly. User may have different use cases and the same applies to audio as well.
For desired_channels
and desired_samples
: if we truly want to pass the shape information, I would suggest to pass a shape
as an input tensor instead. Note in tf.decode_wav, desired_channels
and desired_samples
are passed as attribute, which means you can not use it in a graph data pipeline where each sample may have different samples and shapes. Passing as a shape
tensor will allow the decision to be made at graph run time, than graph construction time.
We could also expose an API of tfio.audio.info
which parses the input and output the shape
, rate
, and dtype
information. Then shape
could be passed to decode
in graph mode:
info = tfio.audio.info(input)
shape = info["shape"]
decoded = tfio.audio.decode(input, shape=info["shape"], dtype=None)
Also note, while shape
could be passed in graph runtime, dtype
has to be passed at graph construction time. So the following will only work in eager mode
info = tfio.audio.info(input)
shape = info["shape"]
dtype = tf.as_dtype(info['dtype'].numpy()) # !!! <= numpy() only works in eager mode
decoded = tfio.audio.decode(input, shape=shape, dtype=dtype)
@yongtang Ah yes, with "eager" I meant decoding the whole data at once. It should obviously work in both modes.
For dtype
you're right, adding it to the API explicitly is is probably sensible if we want to cover the majority of use cases.
For desired_channels
and desired_samples
I think our understanding differs. What you describe is exactly what I would want: enforcing static dimensions for data which in reality might differ slightly. Imagine I'm having a CNN trained to classify 1s snippets of mono audio, i.e. it expects a spectrogram of fixed dimension n x m. Now in practice my data files might be 0.98s, 1.02s, ... 5% of the files are stereo instead of mono for some weird problem with the data collection process. Here it is super helpful to be able to enforce a static shape, if necessary by padding/cropping.
In cases where we would want to pass in a dynamic shape tensor, we probably can just leave the dimensions as unknown as the model needs to handle them dynamically anyways. But I'm absolutely fine with make this a Python-level helper-function instead of integrating it into the core operators.
@jjedele Thanks. That makes a lot sense. I think what we could do is to add additional processing
audio APIs such that, after the audio is read into a tensor (in original shape), we could easily run those processing
audio APIs to further make it conforming to the input you expect?
For example:
audio = tfio.audio.decode_wav() <= in shape of non regular [100, 2], and a dtype of int16
audio = tfio.audio.normalize(audio, shape=[50, 1], dtype=tf.float32)
Will the above work?
@jjedele By the way, there is already a tfio.experimental.audio.resample
which changes the sample rate. The normalize
could very well be another API similar to resample
.
@jjedele In case of random access and sequential access, random access needs a precise sample offset while sequential access only needs a Next
api which could return any number of samples.
Not limit the number of samples will be useful, because in many formats samples are grouped into packet which could be any number of samples. Naturally Next
could just return the next chunk
of sampels naturally divided (by packet/etc).
So I think all we need is something like below to cover all the cases discussed above:
virtual Status Read(int64 start, int64 stop, allocator) = 0; // Note in case of read the whole content, start=0, stop=-1
virtual TensorShape shape() = 0;
virtual DataType dtype() = 0;
virtual int64 rate() = 0;
Update: I will take the Next
part back. This will limit the case when tf.data.Dataset pipeline might have more than one thread.
@yongtang Yeah, treating this the same way as the resample operation does make perfect sense to me.
Hi @yongtang and @jjedele I really appreciate your motivation making tf.io a great audio hub. Just a few short points about use cases for tf.io:
I think the most important use-case for training is the use of AudioIOTensor
inside of tf.data pipelines. I think this works already nicely and it would be great if more decoders (e.g. mp4) could be supported through the ffmpeg ops.
- eager decode: We have comparatively small files we process completely. Examples are common ASR corpora like LibriSpeech and Mozilla CommonVoice (samples are sentences encoded as MP3/FLAC, typically <10s). Also useful if we don't have the files in raw form, but e.g. in TFRecords or some database.
I think this is a very useful and probably would be used in practice a lot. @jjedele I am not sure if this would only be useful for short files. Given that decoders such as minimp3 can decode any given byte-sequence (even with headers missing) it would make sense to include seeking support for in memory decoding as well.
Imagine the following scenario of millions of medium length audio files that should be streamed into the training pipeline in small (random) chunks. If a user would use tf.data + AudioIOTensor
he would probably get good performance but i/o will probably be the bottleneck (especially on GCS) when the chunks to be read are small (a few seconds). Therefore audio data should be written batched in a container such as tfrecords or a database as pointed out by @jjedele. I personally think, given the great performance and robustness of minimp3, that mp3 would be a great format for such a usecase and a decode_mp3
should be added soon.
Now the question is where would users apply the chunking?
decode_mp3
decode_mp3(start...)
@faroit The tfrecord format may not be a good solution in many cases. tfrecord is a sequential serialization based on protobuf. It serves the purpose of generic serialization (as protobuf is a generic serialization that can take any input), but has has quite a few limitations: 1) No indexing support so performance will be very bad unless you run it that sequentially. 3) Serialization and deserialization of probobuf may not be bad compared with other generic serialization format (e.g., thrift/avro/etc), the overhead is still there as tfrecord is generic. 3) It is yet another storage format that is not used widely outside. In organizations that are not solely for tf-based ml, the format is pretty isolated. As an example, with mp3 or mp4 it could be consumed not just by machine learning researchers but also by almost anyone else. This is not the case for tfrecord: you will need to have dedicated storage (plus maintenance with devops/sres) for tfrecord that only have one consumer for this format, and converting back and forth so that someone else can use it.
Honestly for audio files, even WAV
format probably could be a better choice than tfrecord:
WAV
is a simple format with very minimal overhead. Pretty much just a memcpy
operation for 16 bit or 32bit.We emphasis on random access because with random access there is a potential to distribute the workload in large clusters. For example, we could launch a cluster with 100 nodes and each nodes will only process a part of an audio file. If a file can only do sequential access, then in order to distribute the workload you have to let every node to always read the file from the beginning to the split location.
By the way, we are planning to add support for writing audio files as part of the tf graph very soon - likely with wav/ogg/mp4 - so it is possible to have a complete data pipeline with audio format in and audio format out. That should ease the usage in most of the situations.
@faroit By the way, the initial mp4 audio support has been added in PR #805 . The mp4a support is not ready for prime time yet, as each packet will need a codec API call. So the performance is not expected to be good at the moment. The next step is to batch decode the packets to speed up (will have PRs soon).
We use minimp4 for parsing the mp4 container, and use
There are several reasons to go with the above route:
@yongtang mp4a support looks nice.
There are several reasons to go with the above route
The decisions sound plausible. For inference on windows, users can still fallback to some of the ffmpeg cli wrappers.
@yongtang Concerning mp4a, can we add substream selection to the feature list? See https://github.com/tensorflow/io/issues/453
@yongtang Concerning your advice on tfrecords. I am pretty much on the same page and we use tf/io for loading WAV files with great success on our own clusters where we have access to performance SSDs. And like many audio researchers, we love that we can still work with audio files directly and we don't really want to move to any other format yet.
But, moving our pipeline to GCS, we found that random access with seeking is way too slow. Hence we are looking for ways to speed up the loading times when high latency is an issue. One way would be to somehow batch the audio. I thought maybe it would be a good idea to put a batch of samples into a large number channels?
Just summing up our use case (probably similar to many other speech/music models):
What tech would you propose for this use-case?
@faroit There are many factors. One potential factor with seeking for GCS is slow might be due to the fact that GCS (or S3/Azure) are not truly file systems, instead it is just a key-value store with each key map to a blob object. In a more simple term, if you try to append one byte to the end of one GCS file, internally GCS actually will just create another blob and replace the old one (no append as it is not a 'file system'). Another potential factor could be implementation of how the GCS file system on TensorFlow is implemented. I will need to take a look to see if there are room for speed up. One side note about GCS and S3: once tensorflow's modular file system C API is in place (likely in weeks), the plan is to move GCS and S3 out of tensorflow core repo and move then into tensorflow-io. We could revisit and take a deep look once this is in place.
Additional thinking about data types. We started with WAV which natively stored as integers. For that reason we are very reluctant to use float numbers. The reason is that user may prefer to process native integers on their own, or the precision loss concerns.
While working on ogg it reminds me that lossy data format by nature is presented as float number after decoding. Even if we force the decoded output to be integer (e.g. int16
), the underlying library (e.g., Ogg Vorbis) will just convert float into int16 anyway.
For that I think it makes sense to
/cc @jjedele @faroit for comment.
For that I think it makes sense to
For lossless formats (e.g., WAV/Flac) use the native data type stored. For lossy formats (e.g., OGG) always use the float.
That makes sense but why not provide a dtype='float64'
to the decoder and do the conversion inside the ops?
@faroit In ogg/vorbis the underlying library libvorbis only use float for internal processing. The float64 is not used internally. The minimp3 is similar: https://github.com/lieff/minimp3/blob/682006f91a95058ad3927b46c7d99df1d674d3e1/minimp3.h#L30-L35
Notice that mp3d_sample_t
is defined as float
so the max precision is float.
@faroit In ogg/vorbis the underlying library libvorbis only use float for internal processing. The float64 is not used internally. The minimp3 is similar: https://github.com/lieff/minimp3/blob/682006f91a95058ad3927b46c7d99df1d674d3e1/minimp3.h#L30-L35
Notice that
mp3d_sample_t
is defined asfloat
so the max precision is float.
yes, sorry for being imprecise. I meant to implement a conversion op so that users will not have to deal with these issues (e.g. see here)
@yongtang regarding GCS
in a more simple term, if you try to append one byte to the end of one GCS file, internally GCS actually will just create another blob and replace the old one (no append as it is not a 'file system').
right, but for reading tf.io.GFile
seems to support seeking, so is that bound the same limitations as well?
Then, regarding a quick solution: Do you have a cost effective work around for training raw audio from a GCS bucket without going through tfrecords serialization?
Then, regarding a quick solution: Do you have a cost effective work around for training raw audio from a GCS bucket without going through tfrecords serialization?
If each file is not large then you could just read the data into memory and do the decode of string. The tf.data.Dataset support parallel operations so you could use num_parallel_calls=...
dataset = tf.data.Dataset.from_tensor_slices([filename1, filename2, ...])
dataset = dataset.map(lambda f: tf.io.read_file(f), num_parallel_calls=...)
dataset = dataset.map(lambda e: decode_wav(e,...), num_parallel_calls=...)
If each file is not large then you could just read the data into memory and do the decode of string. The tf.data.Dataset support parallel operations so you could use num_parallel_calls=...
WAVs are too large to keep in memory but with MP3s it might be possible - waiting for #815 then
I meant to implement a conversion op so that users will not have to deal with these issues (e.g. see here)
There are also some other considerations, as WAV could have 24bit or even higher, and could also be 8bit which is unsigned bytes. And user may simply prefer to work on native types (in case of lower precision) as integer is much faster. Another semi-related issue is tensorflow/tensorflow#1763 which is about image, but could apply to audio for certain users as well.
There are also some other considerations, as WAV could have 24bit or even higher, and could also be 8bit which is unsigned bytes. And user may simply prefer to work on native types (in case of lower precision) as integer is much faster. Another semi-related issue is tensorflow/tensorflow#1763 which is about image, but could apply to audio for certain users as well.
yes, totally agree here. We use integer augmentations after loading wav with tfio and converting to float early in the pipeline slows down the pipeline a lot.
Then why not go with native formats (either float or int) for the decode/encode ops (+add good documention) and offer conversion for AudioIODataset
and AudioIOTensor
?
@faroit Right now I'm unfortunately a bit blocked from working on #815 (had some unexpected changes at work and now things are chaotic). In case you need decode_mp3
urgently, I can point you to my prototype repo (https://github.com/jjedele/tf-decode-mp3). Doesn't have a convenient pip install though.
My motivation for creating it was similar to yours (GCS, TFRecord). We've seen some really good benefits wrt throughput.
@jjedele thanks, will give it a shot (and probably make comments there instead of here)
Then why not go with native formats (either float or int) for the decode/encode ops
@faroit Added PR #859 for that.
@jjedele Thanks for the work so far! We are looking for next release v0.13.0 aligning with the release of TF 2.2.0 final. I would expect the TF 2.2.0 to be released at least several weeks later (it is in RC0 now, normally with at least RC1 and RC2 and each RC takes one or two weeks). Let's see if #815 could be merged before v0.13.0. I can also help carry the PRs if needed.
@jjedele @faroit PR #865 adds tfio.experimental.audio.decode_mp3
support.
I also tried to come up with a encode_mp3, though I haven't been able to find a feasible mp3 encoder with Apache/MIT/BSD License. So just skipped this part. If there is enough interests we could revisit to see what could be done for mp3 encoder.
On another note, maybe mp3 encoder is not that relevant after all, as people now mostly moved to mp4a (with codec available), or ogg/flac for open source, and mp3 is only suitable for files in the past?
I also tried to come up with a encode_mp3, though I haven't been able to find a feasible mp3 encoder with Apache/MIT/BSD License. So just skipped this part. If there is enough interests we could revisit to see what could be done for mp3 encoder.
Here BSD licensed mp3 encoder: https://github.com/lieff/mp3-enc-bsd quality not so good as lame though.
@lieff Ah thanks! I will take a look.
@lieff It looks like mp3-enc-bsd is file based, which might not fit tensor as tensor is in memory. We could create a temporarily file and copy back into tensor but that likely will have quite some performance impact.
I take a look at LAME, it looks like the API is very stable. And since the license is LGPL, we could dynamically load the API at runtime. This works for linux. (The reason we are trying to not use FFmpeg, is that FFmpeg's API changes too frequently). On Ubuntu and even CentOS, libmp3lame could be installed easily with stable APIs.
I have updated the PR #865 for tfio.experimental.audio.encode_mp3
support (linux only).
For macOS and Windows, while it is still possible to ask user to install lame library, it is really not a great choice because user's machines could come up with different versions of different channels.
On Windows I think Windows native Media Foundation API support MP3 encoding, that could get around different libraries dependencies issue.
On macOS, unfortunately AVFoundaton does not have mp3 encoder support (might be patent issue as mp3 patent was only expired in 2017?). Maybe we will just leave it unsupported as I imagine people may prefer encode_mp4a more than encode_mp3 on macOS.
Yes, currently this BSD code have many drawbacks. It's also uses global variables and not thread safe. Around week of work with this code required to reach needed condition.
Thanks @lieff. I think we are fine with LAME for now, as it is quite stable in API.
@faroit @jjedele With PR #865 we are almost there for mp3 encode/decode. Before we finalize the API and move to tfio.audio
, there might be one more consideration about naming:
At the moment we mix the usage of container name and codec name for Ogg. In theory Ogg is just a container so it could fit with Vorbis, Flac, Spex. The same is true for (soon to be supported) MP4 as MP4 is just a container. See list of codec and formats:
https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Audio_codecs
For audio and video I am inclined to use decode_codec
as the underlying implementation is always codec based.
That means we will use decode_vorbis
for Ogg (Vorbis) and decode_aac
for MP4 (AAC).
@yontang I think that makes sense. decode_aac
sounds completely natural, I also think many other programs use AAC rather than MP4. decode_vorbis
sounds a bit weird, but as you said, it's consistent.
@yongtang sorry, I was a busy and couldn't actually test the issue regarding float conversion of minimp3 decoding inside tf.data...
does tfio.IOTensor.graph(tf.int16).from_audio(path)
yield a correctly scaled tf.int16
for mp3s in the tfio 0.12.0 release?
@faroit in 0.12.0 it takes the int16 which is scaled in minimp3 internally. In the next release we will use tf.float32
instead.
In mp4a with AAC codec, one issue we have to deal with, is the encoder-delay. Normally AAC adds a delay at the beginning and padding at the end. In one case I noticed that if no delay information is present, then the decoded samples from mp4a/aac will be larger than the original PCM.
@lieff @faroit @jjedele do you know if there is a way to find out the delay from the mp4a file?
In mp3, delay and padding information can be stored in vbr tag. As I can see, there some similar support exists in mp4 container for aac: https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFAppenG/QTFFAppenG.html
Thanks @lieff. I didn't realize delay and padding also exists in mp3. In tensorflow-io the decode_mp3
I implemented didn't capture the delay and padding into consideration. I am wondering if it is possible to expose the vbr tag (or delay and padding info) in minimp3?
All minimp3_ex functions already uses information from vbr tag (if found) to remove delay and padding samples (same as lame decoder). For mp3dec_ex_t this information is stored: mp3dec_ex_t::start_delay: delay samples (pre-multiplied by channels). mp3dec_ex_t::detected_samples: total_samples - delay_samples - padding_samples (pre-multiplied by channels). mp3dec_ex_t::vbr_tag_found: nonzero if vbr tag found and above values filled.
Thanks @lieff ! looks like mp3 has been well covered.
Wondering if there is any way mp4 could handle delay as well?
I saw a this link:
https://gitlab.com/mbunkus/mkvtoolnix/issues/715
which uses iTunSMPB
to identify the delay, though not every mp4 file has this tag it looks.
Not all mp3 files have vbr tag too. From what I know about aac - different profiles have different algorithmic delay, so delay and padding can't be guessed. May be 2112 delay is used in most encoders and can be assumed, but padding depend on real track length (with 1 sample precision).
Thanks @lieff . I found another link: https://forum.doom9.org/showthread.php?p=1563929#post1563929
Which captures different delay scenarios. I think having an option of allow user to pass a delay
value would be a good choice. In case user give a delay
value then it will be applied. Otherwise the delay
will be done in a "best effort" case by case.
@lieff On a second thought, maybe the delay and padding is not as important in machine learning. It may not be exactly precise, but in machine learning, several ms different is not terrible for either training and inference. So we could leave this issue alone, and see if there is a concrete need from users to ask for precise decoding first.
@lieff On a second thought, maybe the delay and padding is not as important in machine learning. It may not be exactly precise, but in machine learning, several ms different is not terrible for either training and inference. So we could leave this issue alone, and see if there is a concrete need from users to ask for precise decoding first.
I fully agree. I don't think that we should solve problems here that libs suchs as ffmpeg are trying to deal with for a long time. For example: ffmpeg's delay handling is not stable across versions... in one version you can a few samples less from one of the encoder/decoder than in other (older) versions.
@faroit @faroit I have been trying to implement decode_aac
on macOS with AVFoundation. I am very close to get it done (already ignored the output delay). However, I noticed one issue we need to fix even if we don't care about the encoder delay.
When I use AVFoundation's AVAudioConverter to convert from mp4 (aac), the generated pcm was very noisy and obviously there is something wrong. After digging through and compare Apple's AVAsset output, it looks Apple believe the first two packet from minimp4 does not belongs to the sound track.
Apple also stripped the last packet but this is less of an issue to us as we could afford to lose 1024 samples at the end for machine learning. But we could not afford to see great mismatch of the sound overall (due to two additional packets injected into the decoder).
Note this is different from the output delay we discussed as output delay needs to drop the samples after
decoding happens. But this means we need to drop the first two packets from the minimp4 before
the decoding happens. Otherwise the generated sound mismatch greatly.
I searched through the Atom list but couldn't find anything obvious to indicate to drop the first two packet. Do you know if there is any reason about that?
@faroit @faroit I found out that if I drop the first packet (1024 samples) then the decoding seems to be fine as well (no noisy background). My guess is that this might be related to decoder delay
. Still not exactly clear about how it works or if there is a definitive number we could use, though.
@yongtang Can you share such .m4a file? I can take a look.
This is about API design of the audio processing in tfio. Comments and discussions are welcomed.
When we started to get into reading audio for tfio, we start with a feature request from the community of 24bit WAV file which was not possible in tf core repo. Since then, we also supported
Dataset
types of access that essentially is a sequential read, and random access (AudioIOTensor) which allows python__getitem__()
and__len__()
styles access in tensorflow graph.Later, we added additional audio format support such as
Flac
,Ogg
,MP3
andMP4
, in addition to the already supportedWAV
file. NoteWAV
,Flac
,Ogg
,MP3
does not requires external .so dependency. ForMP4
, we use AVFoundation on macOS, but needs FFmpeg's .so files on Linux. We also added theresample
as a basic ops (not a python class) recently upon request from community.Recently, new requests to add supports for decoding memory input audio into tensors comes in play (See PR #815). So the APIs for audio processing in tfio will expand even further.
Initially we have not truly come up with a good audio api design as those features are added over gradually, span across more than a year. Now as the features are gradually in shape, I think it is point to start lay out tfio.audio to revisit APIs so that:
The audio processing in tf repo, is pretty much limited and is really not a good template for us to follow. I also look into pytorch's audio package as well. While pytorch's APIs are typically very clean, the audio related apis are not as clean as some other parts. pytorch also relies on sox which could be an issue (sometimes could be a benefit depending on the scenario.).
We can summarize the use case that has been covered:
__getitem__
,__len__
,shape
,dtype
, andrate
API in python and could be used in graph mode (certain restriction apply). IOTensor is a random access that allows reading data at any index (__getitem__
). This is very useful in situations where user want to access just a small clip of the audio. Theshape
API allows user to get the number of channels and samples.shape = [samples, channels]
. The shape and dtype are probed automatically in eager mode. dtype has to be provided by user in graph mode. It is file based (or callback-based) and will not load the whole clip into memory (lazy-loaded).decode
/decode_mp3
/decode_mp4
/etc (see discussion in #815) where the API could be used to decode a string (memory) into tensor. This is useful in situations where we already load the small clip of audio files into memory and we just want to do a decoding. On a separate discussion, we may want to exposemetainfo()
API in audio to return back user the shape, dtype, and rate of the audio clip in kernel ops.encode
/encode_mp3
/encode_mp4
/etc which is the counterpart of 3. We haven't add any ops yet though I the need is obvious.resample
is an audio processing API we recently added. We may want to expand more in the future.Given the above I think the layout could be:
The list could be easy though details could be challenging. For example in tf:
Do we really want to provide the option of
desired_channels
anddesired_samples
? My understanding is that we already provided AudioIODataset and AudioIOTensor which already allows lazy-loaded partial access (either sequentially or completely random). I don't see we want to manipulate further. Any additional manipulation could be done after a tensor has been returned.For example, to reduce the stereo to mono is just to drop one dim in tensor (after the tensor is returned from decode_wav).
Another thing is the dtype. In order for a custom op to work in graph mode, Shape is not needed because shape could be pass as unknown dimension. But, at the minimum a dtype has to be provided (unless it is known before hand). For example, even in
decode_wav
case, the above will not work for 24bit wav file. We do have to provide an API with minimal input:There might be some other details to sort out. /cc @jjedele @terrytangyuan @BryanCutler, also cc @faroit @lieff in case you are interested.