tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
709 stars 288 forks source link

Audio Processing API and tfio.audio #839

Open yongtang opened 4 years ago

yongtang commented 4 years ago

This is about API design of the audio processing in tfio. Comments and discussions are welcomed.

When we started to get into reading audio for tfio, we start with a feature request from the community of 24bit WAV file which was not possible in tf core repo. Since then, we also supported Dataset types of access that essentially is a sequential read, and random access (AudioIOTensor) which allows python __getitem__() and __len__() styles access in tensorflow graph.

Later, we added additional audio format support such as Flac, Ogg, MP3 and MP4, in addition to the already supported WAV file. Note WAV, Flac, Ogg, MP3 does not requires external .so dependency. For MP4, we use AVFoundation on macOS, but needs FFmpeg's .so files on Linux. We also added the resample as a basic ops (not a python class) recently upon request from community.

Recently, new requests to add supports for decoding memory input audio into tensors comes in play (See PR #815). So the APIs for audio processing in tfio will expand even further.

Initially we have not truly come up with a good audio api design as those features are added over gradually, span across more than a year. Now as the features are gradually in shape, I think it is point to start lay out tfio.audio to revisit APIs so that:

  1. Capture the usage from community for different use cases.
  2. Allow expansion in the future (our focus was on decoding but encoding will be added soon).

The audio processing in tf repo, is pretty much limited and is really not a good template for us to follow. I also look into pytorch's audio package as well. While pytorch's APIs are typically very clean, the audio related apis are not as clean as some other parts. pytorch also relies on sox which could be an issue (sometimes could be a benefit depending on the scenario.).

We can summarize the use case that has been covered:

  1. AudioIODataset which is a subclass of tf.data.Dataset and could be passed to tf.keras. Dataset is a sequential access (not random accessed) that allows iteration from the beginning to the end. The shape and dtype are probed automatically in eager mode. dtype has to be provided by user in graph mode. It is normally file based (or callback-based) and will not load the whole clip into memory (lazy-loaded).
  2. AudioIOTensor which exposes __getitem__, __len__, shape, dtype, and rate API in python and could be used in graph mode (certain restriction apply). IOTensor is a random access that allows reading data at any index (__getitem__). This is very useful in situations where user want to access just a small clip of the audio. The shape API allows user to get the number of channels and samples. shape = [samples, channels]. The shape and dtype are probed automatically in eager mode. dtype has to be provided by user in graph mode. It is file based (or callback-based) and will not load the whole clip into memory (lazy-loaded).
  3. decode/decode_mp3/decode_mp4/etc (see discussion in #815) where the API could be used to decode a string (memory) into tensor. This is useful in situations where we already load the small clip of audio files into memory and we just want to do a decoding. On a separate discussion, we may want to expose metainfo() API in audio to return back user the shape, dtype, and rate of the audio clip in kernel ops.
  4. encode/encode_mp3/encode_mp4/etc which is the counterpart of 3. We haven't add any ops yet though I the need is obvious.
  5. resample is an audio processing API we recently added. We may want to expand more in the future.

Given the above I think the layout could be:

tfio.audio
  - AudioIODataset
  - AudioIOTensor
  - decode
  - decode_mp3
  - decode_mp4
  - decode_wav
  - encode
  - encode_mp3
  - encode_mp4
  - encode_wav
  - resample

The list could be easy though details could be challenging. For example in tf:

tf.audio.decode_wav(
    contents, desired_channels=-1, desired_samples=-1, name=None
)

Do we really want to provide the option of desired_channels and desired_samples? My understanding is that we already provided AudioIODataset and AudioIOTensor which already allows lazy-loaded partial access (either sequentially or completely random). I don't see we want to manipulate further. Any additional manipulation could be done after a tensor has been returned.

For example, to reduce the stereo to mono is just to drop one dim in tensor (after the tensor is returned from decode_wav).

Another thing is the dtype. In order for a custom op to work in graph mode, Shape is not needed because shape could be pass as unknown dimension. But, at the minimum a dtype has to be provided (unless it is known before hand). For example, even in decode_wav case, the above will not work for 24bit wav file. We do have to provide an API with minimal input:

tfio.audio.decode_wav(input, dtype, name=None) 

There might be some other details to sort out. /cc @jjedele @terrytangyuan @BryanCutler, also cc @faroit @lieff in case you are interested.

jjedele commented 4 years ago

Thank you for writing up this detailed issue, @yongtang !

I'll try to connect the current API with what would be use cases in a ML context:

I think all of these are valid.

W.r.t. desired_channels and desired_samples you have convinced me that we do not need to add it to the operators. Maybe we could keep it as Python-based helper function in the audio module though. I'm a big fan of enforcing explicit shapes - few errors are more annoying than having your training crash after hours because for some reason one file in your dataset is stereo instead of mono.

For data type, why can we not simply make it float32 normalized to [-1,+1]? This is usually what you would do anyways and also what downstream functions from tf.signal expect. If you need "raw" samples for whatever reason you can just scale it back up yourself.

yongtang commented 4 years ago

@jjedele for eager decode actually it should be graph or eager decode, as we could make the decode work in both graph and eager mode.

The decode op itself is a very useful scenario that could be couple with other APIs to build a full continuous pipeline. For example, we could combine Kakfa with audio decode, so that kafka is used to read incoming audio data continuously and then decode:

dataset = tfio.IODataset.from_kafka()
dataset = dataset.map(lambda message: tfio.audio.decode_mp3(message)
# further processing
...

For always make float32: a semi-related issue https://github.com/tensorflow/tensorflow/issues/1763 probably explains why we want to defer the conversion to use explicitly. User may have different use cases and the same applies to audio as well.

For desired_channels and desired_samples: if we truly want to pass the shape information, I would suggest to pass a shape as an input tensor instead. Note in tf.decode_wav, desired_channels and desired_samples are passed as attribute, which means you can not use it in a graph data pipeline where each sample may have different samples and shapes. Passing as a shape tensor will allow the decision to be made at graph run time, than graph construction time.

We could also expose an API of tfio.audio.info which parses the input and output the shape, rate, and dtype information. Then shape could be passed to decode in graph mode:

info = tfio.audio.info(input)
shape = info["shape"]
decoded = tfio.audio.decode(input, shape=info["shape"], dtype=None)

Also note, while shape could be passed in graph runtime, dtype has to be passed at graph construction time. So the following will only work in eager mode

info = tfio.audio.info(input)
shape = info["shape"]
dtype = tf.as_dtype(info['dtype'].numpy()) # !!! <= numpy() only works in eager mode
decoded = tfio.audio.decode(input, shape=shape, dtype=dtype)
jjedele commented 4 years ago

@yongtang Ah yes, with "eager" I meant decoding the whole data at once. It should obviously work in both modes.

For dtype you're right, adding it to the API explicitly is is probably sensible if we want to cover the majority of use cases.

For desired_channels and desired_samples I think our understanding differs. What you describe is exactly what I would want: enforcing static dimensions for data which in reality might differ slightly. Imagine I'm having a CNN trained to classify 1s snippets of mono audio, i.e. it expects a spectrogram of fixed dimension n x m. Now in practice my data files might be 0.98s, 1.02s, ... 5% of the files are stereo instead of mono for some weird problem with the data collection process. Here it is super helpful to be able to enforce a static shape, if necessary by padding/cropping. In cases where we would want to pass in a dynamic shape tensor, we probably can just leave the dimensions as unknown as the model needs to handle them dynamically anyways. But I'm absolutely fine with make this a Python-level helper-function instead of integrating it into the core operators.

yongtang commented 4 years ago

@jjedele Thanks. That makes a lot sense. I think what we could do is to add additional processing audio APIs such that, after the audio is read into a tensor (in original shape), we could easily run those processing audio APIs to further make it conforming to the input you expect?

For example:

audio = tfio.audio.decode_wav() <= in shape of non regular [100, 2], and a dtype of int16
audio = tfio.audio.normalize(audio, shape=[50, 1], dtype=tf.float32) 

Will the above work?

yongtang commented 4 years ago

@jjedele By the way, there is already a tfio.experimental.audio.resample which changes the sample rate. The normalize could very well be another API similar to resample.

yongtang commented 4 years ago

@jjedele In case of random access and sequential access, random access needs a precise sample offset while sequential access only needs a Next api which could return any number of samples.

Not limit the number of samples will be useful, because in many formats samples are grouped into packet which could be any number of samples. Naturally Next could just return the next chunk of sampels naturally divided (by packet/etc).

So I think all we need is something like below to cover all the cases discussed above:

virtual Status Read(int64 start, int64 stop, allocator) = 0; // Note in case of read the whole content, start=0, stop=-1
virtual TensorShape shape() = 0;
virtual DataType dtype() = 0;
virtual int64 rate() = 0;
yongtang commented 4 years ago

Update: I will take the Next part back. This will limit the case when tf.data.Dataset pipeline might have more than one thread.

jjedele commented 4 years ago

@yongtang Yeah, treating this the same way as the resample operation does make perfect sense to me.

faroit commented 4 years ago

Hi @yongtang and @jjedele I really appreciate your motivation making tf.io a great audio hub. Just a few short points about use cases for tf.io:

I think the most important use-case for training is the use of AudioIOTensor inside of tf.data pipelines. I think this works already nicely and it would be great if more decoders (e.g. mp4) could be supported through the ffmpeg ops.

  • eager decode: We have comparatively small files we process completely. Examples are common ASR corpora like LibriSpeech and Mozilla CommonVoice (samples are sentences encoded as MP3/FLAC, typically <10s). Also useful if we don't have the files in raw form, but e.g. in TFRecords or some database.

I think this is a very useful and probably would be used in practice a lot. @jjedele I am not sure if this would only be useful for short files. Given that decoders such as minimp3 can decode any given byte-sequence (even with headers missing) it would make sense to include seeking support for in memory decoding as well.

Imagine the following scenario of millions of medium length audio files that should be streamed into the training pipeline in small (random) chunks. If a user would use tf.data + AudioIOTensor he would probably get good performance but i/o will probably be the bottleneck (especially on GCS) when the chunks to be read are small (a few seconds). Therefore audio data should be written batched in a container such as tfrecords or a database as pointed out by @jjedele. I personally think, given the great performance and robustness of minimp3, that mp3 would be a great format for such a usecase and a decode_mp3 should be added soon.

Now the question is where would users apply the chunking?

  1. write out chunked and batched tfrecord files with mp3s (deterministic)
  2. use lazy load of byte data from the tfrecords and decode the full byte string with decode_mp3
  3. load the full byte sequence and apply decode_mp3(start...)
yongtang commented 4 years ago

@faroit The tfrecord format may not be a good solution in many cases. tfrecord is a sequential serialization based on protobuf. It serves the purpose of generic serialization (as protobuf is a generic serialization that can take any input), but has has quite a few limitations: 1) No indexing support so performance will be very bad unless you run it that sequentially. 3) Serialization and deserialization of probobuf may not be bad compared with other generic serialization format (e.g., thrift/avro/etc), the overhead is still there as tfrecord is generic. 3) It is yet another storage format that is not used widely outside. In organizations that are not solely for tf-based ml, the format is pretty isolated. As an example, with mp3 or mp4 it could be consumed not just by machine learning researchers but also by almost anyone else. This is not the case for tfrecord: you will need to have dedicated storage (plus maintenance with devops/sres) for tfrecord that only have one consumer for this format, and converting back and forth so that someone else can use it.

Honestly for audio files, even WAV format probably could be a better choice than tfrecord:

  1. WAV is a simple format with very minimal overhead. Pretty much just a memcpy operation for 16 bit or 32bit.
  2. It allow random access (vs. sequential) which is very crucial to distribute the workload.
  3. It could be consumed by anyone (not just machine learning researchers with dedicated tools) and could be played immediately for validation.

We emphasis on random access because with random access there is a potential to distribute the workload in large clusters. For example, we could launch a cluster with 100 nodes and each nodes will only process a part of an audio file. If a file can only do sequential access, then in order to distribute the workload you have to let every node to always read the file from the beginning to the split location.

By the way, we are planning to add support for writing audio files as part of the tf graph very soon - likely with wav/ogg/mp4 - so it is possible to have a complete data pipeline with audio format in and audio format out. That should ease the usage in most of the situations.

yongtang commented 4 years ago

@faroit By the way, the initial mp4 audio support has been added in PR #805 . The mp4a support is not ready for prime time yet, as each packet will need a codec API call. So the performance is not expected to be good at the moment. The next step is to batch decode the packets to speed up (will have PRs soon).

We use minimp4 for parsing the mp4 container, and use

There are several reasons to go with the above route:

  1. MP4 has the potential license/patent issues. With native macOS and Windows APIs it is already covered (not the case on Linux or FFmpeg).
  2. FFmpeg's API differs version to version, it is really hard to make the same code link against different versions. On tensorflow-io we only pin to the libavcodec version that shipped with Ubuntu 16.04 and 18.04. And even with that we are already facing lots of issues.
  3. FFmpeg also does not works well on macOS or Windows due to the API compatible reasons. We pin to system provided version on Ubuntu 16.04 and 18.04 for Linux. But for macOS and Windows which version should we pin? User can install FFmpeg from so many distribution channel with any version the like. It is just not practical to provide support for macOS and Windows for FFmpeg.
faroit commented 4 years ago

@yongtang mp4a support looks nice.

There are several reasons to go with the above route

The decisions sound plausible. For inference on windows, users can still fallback to some of the ffmpeg cli wrappers.

faroit commented 4 years ago

@yongtang Concerning mp4a, can we add substream selection to the feature list? See https://github.com/tensorflow/io/issues/453

faroit commented 4 years ago

@yongtang Concerning your advice on tfrecords. I am pretty much on the same page and we use tf/io for loading WAV files with great success on our own clusters where we have access to performance SSDs. And like many audio researchers, we love that we can still work with audio files directly and we don't really want to move to any other format yet.

But, moving our pipeline to GCS, we found that random access with seeking is way too slow. Hence we are looking for ways to speed up the loading times when high latency is an issue. One way would be to somehow batch the audio. I thought maybe it would be a good idea to put a batch of samples into a large number channels?

Just summing up our use case (probably similar to many other speech/music models):

What tech would you propose for this use-case?

yongtang commented 4 years ago

@faroit There are many factors. One potential factor with seeking for GCS is slow might be due to the fact that GCS (or S3/Azure) are not truly file systems, instead it is just a key-value store with each key map to a blob object. In a more simple term, if you try to append one byte to the end of one GCS file, internally GCS actually will just create another blob and replace the old one (no append as it is not a 'file system'). Another potential factor could be implementation of how the GCS file system on TensorFlow is implemented. I will need to take a look to see if there are room for speed up. One side note about GCS and S3: once tensorflow's modular file system C API is in place (likely in weeks), the plan is to move GCS and S3 out of tensorflow core repo and move then into tensorflow-io. We could revisit and take a deep look once this is in place.

yongtang commented 4 years ago

Additional thinking about data types. We started with WAV which natively stored as integers. For that reason we are very reluctant to use float numbers. The reason is that user may prefer to process native integers on their own, or the precision loss concerns.

While working on ogg it reminds me that lossy data format by nature is presented as float number after decoding. Even if we force the decoded output to be integer (e.g. int16), the underlying library (e.g., Ogg Vorbis) will just convert float into int16 anyway.

For that I think it makes sense to

  1. For lossless formats (e.g., WAV/Flac) use the native data type stored.
  2. For lossy formats (e.g., OGG) always use the float.

/cc @jjedele @faroit for comment.

faroit commented 4 years ago

For that I think it makes sense to

For lossless formats (e.g., WAV/Flac) use the native data type stored. For lossy formats (e.g., OGG) always use the float.

That makes sense but why not provide a dtype='float64' to the decoder and do the conversion inside the ops?

yongtang commented 4 years ago

@faroit In ogg/vorbis the underlying library libvorbis only use float for internal processing. The float64 is not used internally. The minimp3 is similar: https://github.com/lieff/minimp3/blob/682006f91a95058ad3927b46c7d99df1d674d3e1/minimp3.h#L30-L35

Notice that mp3d_sample_t is defined as float so the max precision is float.

faroit commented 4 years ago

@faroit In ogg/vorbis the underlying library libvorbis only use float for internal processing. The float64 is not used internally. The minimp3 is similar: https://github.com/lieff/minimp3/blob/682006f91a95058ad3927b46c7d99df1d674d3e1/minimp3.h#L30-L35

Notice that mp3d_sample_t is defined as float so the max precision is float.

yes, sorry for being imprecise. I meant to implement a conversion op so that users will not have to deal with these issues (e.g. see here)

faroit commented 4 years ago

@yongtang regarding GCS

in a more simple term, if you try to append one byte to the end of one GCS file, internally GCS actually will just create another blob and replace the old one (no append as it is not a 'file system').

right, but for reading tf.io.GFile seems to support seeking, so is that bound the same limitations as well?

Then, regarding a quick solution: Do you have a cost effective work around for training raw audio from a GCS bucket without going through tfrecords serialization?

yongtang commented 4 years ago

Then, regarding a quick solution: Do you have a cost effective work around for training raw audio from a GCS bucket without going through tfrecords serialization?

If each file is not large then you could just read the data into memory and do the decode of string. The tf.data.Dataset support parallel operations so you could use num_parallel_calls=...

dataset = tf.data.Dataset.from_tensor_slices([filename1, filename2, ...])
dataset = dataset.map(lambda f: tf.io.read_file(f), num_parallel_calls=...)
dataset = dataset.map(lambda e: decode_wav(e,...), num_parallel_calls=...)
faroit commented 4 years ago

If each file is not large then you could just read the data into memory and do the decode of string. The tf.data.Dataset support parallel operations so you could use num_parallel_calls=...

WAVs are too large to keep in memory but with MP3s it might be possible - waiting for #815 then

yongtang commented 4 years ago

I meant to implement a conversion op so that users will not have to deal with these issues (e.g. see here)

There are also some other considerations, as WAV could have 24bit or even higher, and could also be 8bit which is unsigned bytes. And user may simply prefer to work on native types (in case of lower precision) as integer is much faster. Another semi-related issue is tensorflow/tensorflow#1763 which is about image, but could apply to audio for certain users as well.

faroit commented 4 years ago

There are also some other considerations, as WAV could have 24bit or even higher, and could also be 8bit which is unsigned bytes. And user may simply prefer to work on native types (in case of lower precision) as integer is much faster. Another semi-related issue is tensorflow/tensorflow#1763 which is about image, but could apply to audio for certain users as well.

yes, totally agree here. We use integer augmentations after loading wav with tfio and converting to float early in the pipeline slows down the pipeline a lot.

Then why not go with native formats (either float or int) for the decode/encode ops (+add good documention) and offer conversion for AudioIODataset and AudioIOTensor?

jjedele commented 4 years ago

@faroit Right now I'm unfortunately a bit blocked from working on #815 (had some unexpected changes at work and now things are chaotic). In case you need decode_mp3 urgently, I can point you to my prototype repo (https://github.com/jjedele/tf-decode-mp3). Doesn't have a convenient pip install though.

My motivation for creating it was similar to yours (GCS, TFRecord). We've seen some really good benefits wrt throughput.

faroit commented 4 years ago

@jjedele thanks, will give it a shot (and probably make comments there instead of here)

yongtang commented 4 years ago

Then why not go with native formats (either float or int) for the decode/encode ops

@faroit Added PR #859 for that.

yongtang commented 4 years ago

@jjedele Thanks for the work so far! We are looking for next release v0.13.0 aligning with the release of TF 2.2.0 final. I would expect the TF 2.2.0 to be released at least several weeks later (it is in RC0 now, normally with at least RC1 and RC2 and each RC takes one or two weeks). Let's see if #815 could be merged before v0.13.0. I can also help carry the PRs if needed.

yongtang commented 4 years ago

@jjedele @faroit PR #865 adds tfio.experimental.audio.decode_mp3 support.

I also tried to come up with a encode_mp3, though I haven't been able to find a feasible mp3 encoder with Apache/MIT/BSD License. So just skipped this part. If there is enough interests we could revisit to see what could be done for mp3 encoder.

On another note, maybe mp3 encoder is not that relevant after all, as people now mostly moved to mp4a (with codec available), or ogg/flac for open source, and mp3 is only suitable for files in the past?

lieff commented 4 years ago

I also tried to come up with a encode_mp3, though I haven't been able to find a feasible mp3 encoder with Apache/MIT/BSD License. So just skipped this part. If there is enough interests we could revisit to see what could be done for mp3 encoder.

Here BSD licensed mp3 encoder: https://github.com/lieff/mp3-enc-bsd quality not so good as lame though.

yongtang commented 4 years ago

@lieff Ah thanks! I will take a look.

yongtang commented 4 years ago

@lieff It looks like mp3-enc-bsd is file based, which might not fit tensor as tensor is in memory. We could create a temporarily file and copy back into tensor but that likely will have quite some performance impact.

I take a look at LAME, it looks like the API is very stable. And since the license is LGPL, we could dynamically load the API at runtime. This works for linux. (The reason we are trying to not use FFmpeg, is that FFmpeg's API changes too frequently). On Ubuntu and even CentOS, libmp3lame could be installed easily with stable APIs.

I have updated the PR #865 for tfio.experimental.audio.encode_mp3 support (linux only).

For macOS and Windows, while it is still possible to ask user to install lame library, it is really not a great choice because user's machines could come up with different versions of different channels.

On Windows I think Windows native Media Foundation API support MP3 encoding, that could get around different libraries dependencies issue.

On macOS, unfortunately AVFoundaton does not have mp3 encoder support (might be patent issue as mp3 patent was only expired in 2017?). Maybe we will just leave it unsupported as I imagine people may prefer encode_mp4a more than encode_mp3 on macOS.

lieff commented 4 years ago

Yes, currently this BSD code have many drawbacks. It's also uses global variables and not thread safe. Around week of work with this code required to reach needed condition.

yongtang commented 4 years ago

Thanks @lieff. I think we are fine with LAME for now, as it is quite stable in API.

yongtang commented 4 years ago

@faroit @jjedele With PR #865 we are almost there for mp3 encode/decode. Before we finalize the API and move to tfio.audio, there might be one more consideration about naming:

At the moment we mix the usage of container name and codec name for Ogg. In theory Ogg is just a container so it could fit with Vorbis, Flac, Spex. The same is true for (soon to be supported) MP4 as MP4 is just a container. See list of codec and formats:

https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Audio_codecs

For audio and video I am inclined to use decode_codec as the underlying implementation is always codec based.

That means we will use decode_vorbis for Ogg (Vorbis) and decode_aac for MP4 (AAC).

jjedele commented 4 years ago

@yontang I think that makes sense. decode_aac sounds completely natural, I also think many other programs use AAC rather than MP4. decode_vorbis sounds a bit weird, but as you said, it's consistent.

faroit commented 4 years ago

@yongtang sorry, I was a busy and couldn't actually test the issue regarding float conversion of minimp3 decoding inside tf.data...

does tfio.IOTensor.graph(tf.int16).from_audio(path) yield a correctly scaled tf.int16 for mp3s in the tfio 0.12.0 release?

yongtang commented 4 years ago

@faroit in 0.12.0 it takes the int16 which is scaled in minimp3 internally. In the next release we will use tf.float32 instead.

yongtang commented 4 years ago

In mp4a with AAC codec, one issue we have to deal with, is the encoder-delay. Normally AAC adds a delay at the beginning and padding at the end. In one case I noticed that if no delay information is present, then the decoded samples from mp4a/aac will be larger than the original PCM.

@lieff @faroit @jjedele do you know if there is a way to find out the delay from the mp4a file?

lieff commented 4 years ago

In mp3, delay and padding information can be stored in vbr tag. As I can see, there some similar support exists in mp4 container for aac: https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFAppenG/QTFFAppenG.html

yongtang commented 4 years ago

Thanks @lieff. I didn't realize delay and padding also exists in mp3. In tensorflow-io the decode_mp3 I implemented didn't capture the delay and padding into consideration. I am wondering if it is possible to expose the vbr tag (or delay and padding info) in minimp3?

lieff commented 4 years ago

All minimp3_ex functions already uses information from vbr tag (if found) to remove delay and padding samples (same as lame decoder). For mp3dec_ex_t this information is stored: mp3dec_ex_t::start_delay: delay samples (pre-multiplied by channels). mp3dec_ex_t::detected_samples: total_samples - delay_samples - padding_samples (pre-multiplied by channels). mp3dec_ex_t::vbr_tag_found: nonzero if vbr tag found and above values filled.

yongtang commented 4 years ago

Thanks @lieff ! looks like mp3 has been well covered.

Wondering if there is any way mp4 could handle delay as well?

I saw a this link:

https://gitlab.com/mbunkus/mkvtoolnix/issues/715

which uses iTunSMPB to identify the delay, though not every mp4 file has this tag it looks.

lieff commented 4 years ago

Not all mp3 files have vbr tag too. From what I know about aac - different profiles have different algorithmic delay, so delay and padding can't be guessed. May be 2112 delay is used in most encoders and can be assumed, but padding depend on real track length (with 1 sample precision).

yongtang commented 4 years ago

Thanks @lieff . I found another link: https://forum.doom9.org/showthread.php?p=1563929#post1563929

Which captures different delay scenarios. I think having an option of allow user to pass a delay value would be a good choice. In case user give a delay value then it will be applied. Otherwise the delay will be done in a "best effort" case by case.

yongtang commented 4 years ago

@lieff On a second thought, maybe the delay and padding is not as important in machine learning. It may not be exactly precise, but in machine learning, several ms different is not terrible for either training and inference. So we could leave this issue alone, and see if there is a concrete need from users to ask for precise decoding first.

faroit commented 4 years ago

@lieff On a second thought, maybe the delay and padding is not as important in machine learning. It may not be exactly precise, but in machine learning, several ms different is not terrible for either training and inference. So we could leave this issue alone, and see if there is a concrete need from users to ask for precise decoding first.

I fully agree. I don't think that we should solve problems here that libs suchs as ffmpeg are trying to deal with for a long time. For example: ffmpeg's delay handling is not stable across versions... in one version you can a few samples less from one of the encoder/decoder than in other (older) versions.

yongtang commented 4 years ago

@faroit @faroit I have been trying to implement decode_aac on macOS with AVFoundation. I am very close to get it done (already ignored the output delay). However, I noticed one issue we need to fix even if we don't care about the encoder delay.

When I use AVFoundation's AVAudioConverter to convert from mp4 (aac), the generated pcm was very noisy and obviously there is something wrong. After digging through and compare Apple's AVAsset output, it looks Apple believe the first two packet from minimp4 does not belongs to the sound track.

Apple also stripped the last packet but this is less of an issue to us as we could afford to lose 1024 samples at the end for machine learning. But we could not afford to see great mismatch of the sound overall (due to two additional packets injected into the decoder).

Note this is different from the output delay we discussed as output delay needs to drop the samples after decoding happens. But this means we need to drop the first two packets from the minimp4 before the decoding happens. Otherwise the generated sound mismatch greatly.

I searched through the Atom list but couldn't find anything obvious to indicate to drop the first two packet. Do you know if there is any reason about that?

yongtang commented 4 years ago

@faroit @faroit I found out that if I drop the first packet (1024 samples) then the decoding seems to be fine as well (no noisy background). My guess is that this might be related to decoder delay. Still not exactly clear about how it works or if there is a definitive number we could use, though.

lieff commented 4 years ago

@yongtang Can you share such .m4a file? I can take a look.