Open veqtor opened 7 years ago
It would be great to use the global conditioning, which is already implemented, to train on the dataset and see if it can generate sounds from a given category. Those with a big machine, let us know if you try it out! Since there are a lot of categories, an embedding of the one-hot encoding may be needed that exploits the fact that the categories are hierarchical and therefore some of the them are related.
El dc., 8 de març 2017 a les 7:05, Göran Sandström (< notifications@github.com>) va escriure:
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. https://research.google.com/audioset/
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/237, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5ry--4yFFaleLI2ZagLuiRDmp2fXks5rjpmOgaJpZM4MWtB7 .
-- Quim Llimona http://lemonzi.me
Hi, I'm currently trying to understand the contents of the Google Audioset. I particularly want to understand/visualize the features that are being used, as I would like to evevntually train a classifier that could work on data I record myself. The paper vaguely explains the mel-spectrograms that they use, however I cant seem to be able to extract the features from the .tfrecord files to begin with. I am using the code from the youtube8m starter code and tried to modify it. Here's what I have so far:
import tensorflow as tf
import numpy as np
from IPython.display import YouTubeVideo
audio_record = '/audioset_v1_embeddings/bal_train/_0.tfrecord'
vid_ids = []
labels = []
audio_embedding = []
start_time_seconds = [] # in secondes
end_time_seconds = []
for example in tf.python_io.tf_record_iterator(audio_record):
tf_example = tf.train.Example.FromString(example)
vid_ids.append(tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8'))
labels.append(tf_example.features.feature['labels'].int64_list.value)
start_time_seconds.append(tf_example.features.feature['start_time_seconds'].float_list.value)
end_time_seconds.append(tf_example.features.feature['end_time_seconds'].float_list.value)
audio_embedding.append(tf_example.features.feature['audio_embedding'].bytes_list.value)
idx = 7 # test a random video
print('video ID',vid_ids[idx])
print('start_time:',np.array(start_time_seconds[idx]))
print('end_time:',np.array(end_time_seconds[idx]))
print('labels : ')
print(np.array(labels[idx]))
def play_one_vid(record_name, video_index):
return vid_ids[video_index]
import matplotlib.pyplot as plt
# this worked on my local jupyter notebook:
YouTubeVideo(vid_ids[idx])
I am able to extract all the useful info I need ; video ID, start time, end time, however, I cannot seem to figure out how to visualize the features themselves. According to the audioset website (https://research.google.com/audioset/download.html) , it should be the 'audio_embedding' features that are of interest, but i havent figured out proper syntax for extraction.
Also, if anyone has more information on the algoritm used for the mel-spectrogram representation for the set, that would be much appreciated.
Thanks
solution to my own question :
audio_record = '/home/jerpint/features_audioset/audioset_v1_embeddings/eval/_1.tfrecord'
vid_ids = []
labels = []
audio_embedding = []
start_time_seconds = [] # in secondes
end_time_seconds = []
feat_audio = []
count = 0
for example in tf.python_io.tf_record_iterator(audio_record):
tf_example = tf.train.Example.FromString(example)
#print(tf_example)
vid_ids.append(tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8'))
labels.append(tf_example.features.feature['labels'].int64_list.value)
start_time_seconds.append(tf_example.features.feature['start_time_seconds'].float_list.value)
end_time_seconds.append(tf_example.features.feature['end_time_seconds'].float_list.value)
tf_seq_example = tf.train.SequenceExample.FromString(example)
n_frames = len(tf_seq_example.feature_lists.feature_list['audio_embedding'].feature)
sess = tf.InteractiveSession()
rgb_frame = []
audio_frame = []
# iterate through frames
for i in range(n_frames):
audio_frame.append(tf.cast(tf.decode_raw(
tf_seq_example.feature_lists.feature_list['audio_embedding'].feature[i].bytes_list.value[0],tf.uint8)
,tf.float32).eval())
sess.close()
feat_audio.append([])
feat_audio[count].append(audio_frame)
count+=1
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. https://research.google.com/audioset/
Would be interesting to use for GC Also contains 1hz audio features that can be used for LC