2018-04-24 - Githubissues

bmcfee commented 6 years ago

ICASSP reca(ss)p?

On the hook: @lostanlen @jongwook @mcartwright

lostanlen commented 6 years ago

sure thing

lostanlen commented 6 years ago

Anna Kruspe and Masataka Goto. RETRIEVAL OF SONG LYRICS FROM SUNG QUERIES.

Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous. UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS

Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz, Gabriel Synnaeve, Emmanuel Dupoux LEARNING FILTERBANKS FROM RAW SPEECH FOR PHONE RECOGNITION

lostanlen commented 6 years ago

For SONYC and BirdVox: Huy Phan, Martin Krawczyk-Becker, Timo Gerkmann, and Alfred Mertins WEIGHTED AND MULTI-TASK LOSS FOR RARE AUDIO EVENT DETECTION

For BirdVox only: (so not in the MIR meeting itself most likely) Zhao Zhao, Sai-Hua Zhang, Zhao Zhao, Zhi-Yong Xu, Kristen Bellisario, Bryan C. Pijanowski AUTOMATIC BIRD VOCALIZATION IDENTIFICATION BASED ON FUSION OF SPECTRAL PATTERN AND TEXTURE FEATURES

mcartwright commented 6 years ago

Hey, I'll send some tomorrow.

m

lostanlen commented 6 years ago

Robin Scheibler, Tokyo Metropolitan University, Japan; Eric Bezzam, EPFL-IC-LCAV, Switzerland; Ivan Dokmanic, University of Illinois at Urbana-Champaign, United States PYROOMACOUSTICS: A PYTHON PACKAGE FOR AUDIO ROOM SIMULATION AND ARRAY PROCESSING ALGORITHMS

I bring this to your attention, not that much about the techniques involved in here (room acoustics) than about the possibility to integrate, but about the possibility of making this a muda dependency. This Python library synthesizes impulse responses given the dimensions of a room, the absorption properties of its surfaces, the location of the source(s) in the room, and the location of the microphone. Because this is a theoretical model, there is an accurate estimation of group delay, which is important for tasks such as beat tracking.

jongwook commented 6 years ago

John Thickstun, Zaid Harchaoui, Dean Foster, Sham M. Kakade INVARIANCES AND DATA AUGMENTATION FOR SUPERVISED MUSIC TRANSCRIPTION

The word 'invariance' doesn't mean anything more than Conv2D exploits translation invariance in images, hence pitch-shift invariance in log-frequency spectrograms. The first layer is very similar to CQT but not exactly, and the window doesn't get narrower in higher frequencies. There was a question about 'why not just CQT?', but I couldn't totally understand the answer which was concerning the parameters in librosa's cqt.

I think the log-frequency representation itself has no reason to be better than CQT, and the good performance was possible thanks to the MusicNet dataset and some novelty in the filters between layers 2-3 and 3-4.

jongwook commented 6 years ago

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper demo slides

The Tacotron 2 paper, for the completeness

mcartwright commented 6 years ago

Ok, here is my list of papers that looked interesting or at least relevant enough that we should know about it. If I had to just pick one paper for us to read next week though, I would vote for UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS, since this is very relevant to some work that the SONYC team might embark on very soon.

Sound Event Detection

KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES Anurag Kumar, Maksim Khadkevich, Christian Fugen

Haven’t looked at this one yet, but seems relevant to the embedding team.

WEIGHTED AND MULTI-TASK LOSS FOR RARE AUDIO EVENT DETECTION Huy Phan, Martin Krawczyk-Beckery, Timo Gerkmanny, and Alfred Mertins

Haven’t looked at this one yet, but seems relevant to the embedding team.

A JOINT SEPARATION-CLASSIFICATION MODEL FOR SOUND EVENT DETECTION OF WEAKLY LABELLED DATA Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley

FACEBOOK ACOUSTIC EVENTS DATASET Haoqi Fan, Jiatong Zhou, Christian Fuegen

Large (500k) dataset of labeled (scenes, objects, and actions) video segments from Facebook

UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous

Very relevant to the semi-supervised work that Justin and I have discussed (we had discussed this very idea before). In his talk, he basically said that labels aren’t that important when learning representations… these audio-only self-supervised techniques will get you most of the way there.

LARGE-SCALE WEAKLY SUPERVISED AUDIO CLASSIFICATION USING GATED CONVOLUTIONAL NEURAL NETWORK Yong Xu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

Very relevant to the MIL folks. If I recall from the talk, they learned a weighted mean pooling.

GAN-based source separation

GENERATIVE ADVERSARIAL SOURCE SEPARATION Y.Cem Subakan, Paris Smaragdis

ADVERSARIAL SEMI-SUPERVISED AUDIO SOURCE SEPARATION APPLIED TO SINGING VOICE EXTRACTION Daniel Stoller, Sebastian Ewert, Simon Dixon

SVSGAN: SINGING VOICE SEPARATION VIA GENERATIVE ADVERSARIAL NETWORK Zhe-Cheng Fan, Yen-Lin Lai, Jyh-Shing R. Jang

Speech Synthesis

LIP2AUDSPEC: SPEECH RECONSTRUCTION FROM SILENT LIP MOVEMENTS VIDEO Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani

I didn’t see this paper, but it looks cool and seems to have decent results.

FFTNET: A REAL-TIME SPEAKER-DEPENDENT NEURAL VOCODER Zeyu Jin, Adam Finkelstein, Gautham J. Mysore, Jingwan Lu

An efficient speech synthesis model with an architecture inspired by the FFT algorithm (not DFT) with quality similar to wavenet

NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM PREDICTIONS Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu

Tacotron 2

Other

SINGING STYLE INVESTIGATION BY RESIDUAL SIAMESE CONVOLUTIONAL NEURAL NETWORKS Cheng-i Wang and George Tzanetakis

I have some issues with their approach, but the paper is relevant to the singing style folks (Johanna, Rachel)

TOWARDS LANGUAGE-UNIVERSAL END-TO-END SPEECH RECOGNITION Suyoun Kim, Michael L. Seltzer

Interesting formulation of a multilingual model. They found gains in setting the problem up with language gating units rather than multi-task. This “modulates” the network depending on the language. I assume this helps the network allocate capacity for different tasks and shared tasks at different levels. Seems like it could be a good formulation for multi-task modes in general.

SEMI-RECURRENT CNN-BASED VAE-GAN FOR SEQUENTIAL DATA GENERATION Mohammad Akbari and Jie Liang

Since you all apparently were talking about VAE and GANS recently, this is a VAE-GAN formulation for generating sequences. It generates the next sample from the encoded latent distribution of the previous frame.

TASNET: TIME-DOMAIN AUDIO SEPARATION NETWORK FOR REAL-TIME, SINGLE-CHANNEL SPEECH SEPARATION Yi Luo and Nima Mesgarani

Very efficient time-domain speech separation

lostanlen commented 6 years ago

Yeah the unsupervised paper from Google was 🔥

lostanlen commented 6 years ago

NB: the Facebook acoustic events was withdrawn. I think they had a problem with the dataset and had to postpone its release (I chaired that session)

jongwook commented 6 years ago

Adding a few more ...

LISTENING TO EACH SPEAKER ONE BY ONE WITH RECURRENT SELECTIVE HEARING NETWORKS Speaker voice separation, by iteratively separating one voice at a time, using a NN predicting the mask and the stop flag.

END-TO-END SOUND SOURCE ENHANCEMENT USING DEEP NEURAL NETWORK IN THE MODIFIED DISCRETE COSINE TRANSFORM DOMAIN This one was claiming to achieve a better result than SEGAN, by using MDCT.

MULTI-CHANNEL DEEP CLUSTERING: DISCRIMINATIVE SPECTRAL AND SPATIAL EMBEDDINGS FOR SPEAKER-INDEPENDENT SPEECH SEPARATION ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING I wasn't familiar with deep clustering for source separation ... so including here as a starting point for study

ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING Similar for CTC (connectionist temporal classification) which seemed to be quite popular among speech recognition papers

EXPLORING SPEECH ENHANCEMENT WITH GENERATIVE ADVERSARIAL NETWORKS FOR ROBUST SPEECH RECOGNITION Speech enhancement via GAN as a pre-processing layer is effective, but only in the frequency domain (Google)

bmcfee commented 6 years ago

Here's another highlight-reel list: http://www.jordipons.me/my-icassp-2018-highlights/

marl / group_meetings

2018-04-24 #11

Sound Event Detection

GAN-based source separation

Speech Synthesis

Other