ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.42k stars 1.29k forks source link

Google Audioset #237

Open veqtor opened 7 years ago

veqtor commented 7 years ago

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. https://research.google.com/audioset/

Would be interesting to use for GC Also contains 1hz audio features that can be used for LC

lemonzi commented 7 years ago

It would be great to use the global conditioning, which is already implemented, to train on the dataset and see if it can generate sounds from a given category. Those with a big machine, let us know if you try it out! Since there are a lot of categories, an embedding of the one-hot encoding may be needed that exploits the fact that the categories are hierarchical and therefore some of the them are related.

El dc., 8 de març 2017 a les 7:05, Göran Sandström (< notifications@github.com>) va escriure:

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. https://research.google.com/audioset/

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/237, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5ry--4yFFaleLI2ZagLuiRDmp2fXks5rjpmOgaJpZM4MWtB7 .

-- Quim Llimona http://lemonzi.me

jerpint commented 7 years ago

Hi, I'm currently trying to understand the contents of the Google Audioset. I particularly want to understand/visualize the features that are being used, as I would like to evevntually train a classifier that could work on data I record myself. The paper vaguely explains the mel-spectrograms that they use, however I cant seem to be able to extract the features from the .tfrecord files to begin with. I am using the code from the youtube8m starter code and tried to modify it. Here's what I have so far:

import tensorflow as tf
import numpy as np
from IPython.display import YouTubeVideo

audio_record = '/audioset_v1_embeddings/bal_train/_0.tfrecord'

vid_ids = []
labels = []
audio_embedding = []
start_time_seconds = [] # in secondes
end_time_seconds = []
for example in tf.python_io.tf_record_iterator(audio_record):
    tf_example = tf.train.Example.FromString(example)

    vid_ids.append(tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8'))
    labels.append(tf_example.features.feature['labels'].int64_list.value)
    start_time_seconds.append(tf_example.features.feature['start_time_seconds'].float_list.value)
    end_time_seconds.append(tf_example.features.feature['end_time_seconds'].float_list.value)
    audio_embedding.append(tf_example.features.feature['audio_embedding'].bytes_list.value)

idx = 7 # test a random video

print('video ID',vid_ids[idx])
print('start_time:',np.array(start_time_seconds[idx]))
print('end_time:',np.array(end_time_seconds[idx]))

print('labels : ')
print(np.array(labels[idx]))

def play_one_vid(record_name, video_index):
    return vid_ids[video_index]
import matplotlib.pyplot as plt
# this worked on my local jupyter notebook:
YouTubeVideo(vid_ids[idx])

I am able to extract all the useful info I need ; video ID, start time, end time, however, I cannot seem to figure out how to visualize the features themselves. According to the audioset website (https://research.google.com/audioset/download.html) , it should be the 'audio_embedding' features that are of interest, but i havent figured out proper syntax for extraction.

Also, if anyone has more information on the algoritm used for the mel-spectrogram representation for the set, that would be much appreciated.

Thanks

jerpint commented 7 years ago

solution to my own question :

audio_record = '/home/jerpint/features_audioset/audioset_v1_embeddings/eval/_1.tfrecord'
vid_ids = []
labels = []
audio_embedding = []
start_time_seconds = [] # in secondes
end_time_seconds = []
feat_audio = []
count = 0
for example in tf.python_io.tf_record_iterator(audio_record):
    tf_example = tf.train.Example.FromString(example)
    #print(tf_example)
    vid_ids.append(tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8'))
    labels.append(tf_example.features.feature['labels'].int64_list.value)
    start_time_seconds.append(tf_example.features.feature['start_time_seconds'].float_list.value)
    end_time_seconds.append(tf_example.features.feature['end_time_seconds'].float_list.value)

    tf_seq_example = tf.train.SequenceExample.FromString(example)
    n_frames = len(tf_seq_example.feature_lists.feature_list['audio_embedding'].feature)

    sess = tf.InteractiveSession()
    rgb_frame = []
    audio_frame = []
    # iterate through frames
    for i in range(n_frames):
        audio_frame.append(tf.cast(tf.decode_raw(
                tf_seq_example.feature_lists.feature_list['audio_embedding'].feature[i].bytes_list.value[0],tf.uint8)
                       ,tf.float32).eval())

    sess.close()
    feat_audio.append([])

    feat_audio[count].append(audio_frame)
    count+=1