Isolated source audio files occasionally are longer in duration than soundscape duration

auroracramer commented 4 years ago

Occasionally, when generating mixtures using only foreground events (no background) and saving the isolated events to disk, one of audio files for the isolated sources is slightly longer than the target soundscape duration. In this specific case, the target soundscape was to be 4 seconds at 16kHz (i.e. 64,000 samples). The anomalous isolated source file had 64,768 samples. In the original context I found this issue, I generated 20,000 soundscapes and 89 of them exhibited this behavior. In each case, only exactly one isolated source exhibited this issue. These audio files always had exactly 64,768 samples.

I've included a reproducible example in an attached zip.

scaper-debug.zip

Within the (unzipped) directory, the following can be run to replicate the issue:

import scaper
scaper.generate_from_jams('./soundscape.jams', './soundscape.wav', save_isolated_events=True)

./soundscape_events/foreground12_jackhammer.wav should be the anomalous file here.

I'm using Python 3.6.7 (on Ubuntu 18.04.4 LTS), and using the scaper version at commit d0431ec7b091d49709dfce25d149f6dfd0e982c8.

pseeth commented 4 years ago

Can repro. I added some stuff to the script to make it a bit more obvious if anyone else is having a go at this issue:

import scaper
import soundfile as sf
import jams
import numpy as np

jams_file = './soundscape.jams'
scaper.generate_from_jams(jams_file, './soundscape.wav', save_isolated_events=True)

jam = jams.load(jams_file)
ann = jam.annotations.search(namespace='scaper')[0]

soundscape_audio, sr = sf.read(ann.sandbox.scaper['soundscape_audio_path'])
isolated_event_audio_paths = ann.sandbox.scaper['isolated_events_audio_path']
isolated_audio = []

for event_spec, event_audio in zip(ann, isolated_event_audio_paths):
    # event_spec contains the event description, label, etc
    # event_audio contains the path to the actual audio
    # make sure the path matches the event description
    assert event_spec.value['role'] in event_audio
    assert event_spec.value['label'] in event_audio

    isolated_audio.append(sf.read(event_audio)[0])

# the sum of the isolated audio should sum to the soundscape
assert np.allclose(sum(isolated_audio), soundscape_audio)

This results in:

Traceback (most recent call last):
  File "mwe.py", line 25, in <module>
    assert sum(isolated_audio) == soundscape_audio
ValueError: operands could not be broadcast together with shapes (64000,) (64768,)

Looking into this now.

pseeth commented 4 years ago

It seems that type of audio file for the Jackhammer is different from all of the others in an adversarial fashion.

before ./soundscape_events/foreground12_jackhammer.wav
samplerate: 16000 Hz
channels: 1
duration: 4.031 s
format: WAV (Microsoft) [WAV]
subtype: Microsoft ADPCM [MS_ADPCM] ./soundscape_events/foreground12_jackhammer.wav (64500,)
after ./soundscape_events/foreground12_jackhammer.wav
samplerate: 16000 Hz
channels: 1
duration: 4.031 s
format: WAV (Microsoft) [WAV]
subtype: Microsoft ADPCM [MS_ADPCM] ./soundscape_events/foreground12_jackhammer.wav (64000,)
readback ./soundscape_events/foreground12_jackhammer.wav
samplerate: 16000 Hz
channels: 1
duration: 4.031 s
format: WAV (Microsoft) [WAV]
subtype: Microsoft ADPCM [MS_ADPCM] ./soundscape_events/foreground12_jackhammer.wav (64768,)

Here I'm just printing out the audio_info, and shapes inside match_sample_length. Here's what this same printout looks like for a file that has no anomalies:

before ./soundscape_events/foreground13_gun_shot.wav
samplerate: 16000 Hz
channels: 1
duration: 4.000 s
format: WAV (Microsoft) [WAV]
subtype: Signed 16 bit PCM [PCM_16] ./soundscape_events/foreground13_gun_shot.wav (64000,)
after ./soundscape_events/foreground13_gun_shot.wav
samplerate: 16000 Hz
channels: 1
duration: 4.000 s
format: WAV (Microsoft) [WAV]
subtype: Signed 16 bit PCM [PCM_16] ./soundscape_events/foreground13_gun_shot.wav (64000,)
readback ./soundscape_events/foreground13_gun_shot.wav
samplerate: 16000 Hz
channels: 1
duration: 4.000 s
format: WAV (Microsoft) [WAV]
subtype: Signed 16 bit PCM [PCM_16] ./soundscape_events/foreground13_gun_shot.wav (64000,)

pseeth commented 4 years ago

The bug seems to literally be here:

https://github.com/justinsalamon/scaper/blob/d0431ec7b091d49709dfce25d149f6dfd0e982c8/scaper/audio.py#L150

The shape of that audio data array is 64000 when writing to the audio file. But then when we write it to disk using soundfile it suddenly becomes 64768. Bizarre!

So I literally just tried running this on that specific audio file (I moved it first two directories up and then ffmpeg it back down):

ffmpeg -i ../../88466-7-0-0.wav 88466-7-0-0.wav

I tried just fixing the source audio file prior to feeding it into Scaper using the statement above.. The output for my print statements now looks like this:

before ./soundscape_events/foreground12_jackhammer.wav
samplerate: 16000 Hz
channels: 1
duration: 4.000 s
format: WAV (Microsoft) [WAV]
subtype: Signed 16 bit PCM [PCM_16] ./soundscape_events/foreground12_jackhammer.wav (64000,)
after ./soundscape_events/foreground12_jackhammer.wav
samplerate: 16000 Hz
channels: 1
duration: 4.000 s
format: WAV (Microsoft) [WAV]
subtype: Signed 16 bit PCM [PCM_16] ./soundscape_events/foreground12_jackhammer.wav (64000,)
./soundscape_events/foreground12_jackhammer.wav: adjusted from 64000 to 64000

and the script I pasted above passes for the exact same JAMS file (but with a fixed 88466-7-0-0.wav file).

Also the file size for the WAV file changed from 37kb to 147kb on my machine. I think how Scaper, Sox, and Soundfile interact with all the different types of sound files needs some more investigation.

So my recommendation right now: normalize all of your data via ffmpeg first so that every single audio file you're trying to mix has the same sample rate, is a .wav file, and has the subtype Signed 16 bit PCM. We should perhaps find subtypes that don't work and throw a warning.

pseeth commented 4 years ago

Tiny demo showing that it's in Soundfile:

import soundfile as sf
import numpy as np
import tempfile

_format = 'WAV'
_subtype = 'MS_ADPCM'

sr = 16000

for l in [1600, 16000, 32000, 64000, 100000, 128000]:
    original_shape = (l,)
    audio = np.random.randn(*original_shape)

    with tempfile.NamedTemporaryFile(suffix='.wav', delete=True) as tmpfile:
        sf.write(tmpfile.name, audio, sr, subtype=_subtype, format=_format)
        audio, sr = sf.read(tmpfile.name)
        print('What I got back from sf.read\t', audio.shape)
        print('What I meant to write to disk\t', original_shape)
        print()

Output here is:

What I got back from sf.read     (2024,)
What I meant to write to disk    (1600,)

What I got back from sf.read     (16192,)
What I meant to write to disk    (16000,)

What I got back from sf.read     (32384,)
What I meant to write to disk    (32000,)

What I got back from sf.read     (64768,)
What I meant to write to disk    (64000,)

What I got back from sf.read     (100188,)
What I meant to write to disk    (100000,)

What I got back from sf.read     (128524,)
What I meant to write to disk    (128000,)

pseeth commented 4 years ago

If you look at the current test for match_sample_length, you can see the formats and subtypes that we actually test on (it's not all of them):

https://github.com/justinsalamon/scaper/blob/d0431ec7b091d49709dfce25d149f6dfd0e982c8/tests/test_audio.py#L55-L71

So while WAV is in there, MS_ADPCM is not one of the subtypes we test the function on. So I guess make sure your source audio going into Scaper is in those tested lists for now.

pseeth commented 4 years ago

After some offline discussion with @jtcramer, I wanted to note for the record that the anomalous files described here are in UrbanSound8k. So if you are using UrbanSound8k and Scaper for source separation purposes, you should make everything uniform before passing it into Scaper, otherwise you'll have some (small) number of bad files at the moment.

justinsalamon commented 4 years ago

thanks @jtcramer for raising and thanks @pseeth for the great analysis!

If the problem is in soundfile then indeed there's little we can do to actually address, but I agree we should issue a warning. We should also add a paragraph to the source material section of the tutorial (right at the beginning) to recommend that people pre-process their material before feeding it into scaper.

@pseeth u PR, i CR?

pseeth commented 4 years ago

Yeah, I'll put in a PR. Can we discuss where the warning should go? Should it go into match_sample_length? It could be predictive, where we check the subtype and format and see if it's in the accepted_subtypes and accepted_formats list (those could move out of the tests and into core as constants). So that check could go at the top of this function after we get audio_info:

https://github.com/justinsalamon/scaper/blob/d0431ec7b091d49709dfce25d149f6dfd0e982c8/scaper/audio.py#L110-L151

Or, we could add stuff at the end of the function to read back the "matched" audio file after writing it. So one more line after this would be

readback_audio, sr = sf.read(audio_path)
if readback_audio.shape[0] != duration_in_samples:
  warnings.warn(ScaperWarning, "blahblahblah"

This would have the cost of reading back the file every time, though. But it would be format/subtype agnostic.

justinsalamon commented 4 years ago

I wonder if something more complete would be to scan the FG/BG library upon initialization of the scaper object, then issue a warning if there are files that aren't .wav or 16bit etc.

Warning: Scaper has found source audio files whose format is different from Signed 16 bit PCM WAV. In rare occasions this could lead to saved isolated events containing an incorrect number of samples due to a known issue with pysoundfile. The recommended course of action is to pre-process your source audio so that all files are signed 16 bit PCM wav using scaper.preprocess_source_material()

We can implement the preprocess function to make life easy for people, assuming scanning the source library can be done efficiently.

On the other hand if this issue only impacts separated sources, we would end up issuing this warning a lot also for people who won't be affected by it. Not sure what's the best option, minimalism vs completeness.

pseeth commented 4 years ago

Should we do this at init or in a separate function? scaper.check_source_material(fg_directory, bg_directory).

One issue I see with doing it at init is that when I'm generating soundscapes in parallel, I have a new Scaper object for every scene that I use to generate. Each Scaper object is given a different random seed. So I think that checking at init would cause every Scaper object (often 20k of them) to check the same foreground/background library over and over.

justinsalamon commented 4 years ago

Two options:

Do outside init (e.g. via a specific function user must call) and encourage user to run manually as part of their scaper scripts
Do inside init but allow user to disable the scan

Thoughts?

pseeth commented 4 years ago

Inside init with the scan disabled sounds like best of both worlds. I think the actual scan should also be a top-level function - not within the Scaper object, but rather at scaper.scan_source_material and scaper.preprocess_source_material. The Scaper object can then call the scan function and recommend the user do the preprocess function.

justinsalamon commented 4 years ago

I think the actual scan should also be a top-level function - not within the Scaper object, but rather at scaper.scan_source_material and scaper.preprocess_source_material. The Scaper object can then call the scan function and recommend the user do the preprocess function.

Agreed.

Inside init with the scan disabled sounds like best of both worlds.

Not sure about this thought. If it's hidden inside the Scaper init but disabled by default, it'll most likely get ignored by the majority of users. It might be preferable to have it enabled by default and issue a message when it runs along the lines of:

Scanning source material for audio format problems. You can disable this scan by setting scaper.Scaper(..., scan_source_material=False). You can execute the scan manually via scaper.scan_source_material(fg_path, bg_path).

?

(this way the user always learns about it and can disable in their own code if they don't want it), or, and perhaps this is the better option, we don't run it by default but we explain and encourage users to do so in the tutorial? We'd have to update all examples (including on the repo README).

justinsalamon / scaper

Isolated source audio files occasionally are longer in duration than soundscape duration #94