marl / group_meetings

Notes and ideas for MARL group meetings
10 stars 0 forks source link

2018-04-24 #11

Closed bmcfee closed 6 years ago

bmcfee commented 6 years ago

ICASSP reca(ss)p?

On the hook: @lostanlen @jongwook @mcartwright

lostanlen commented 6 years ago

sure thing

lostanlen commented 6 years ago

Anna Kruspe and Masataka Goto. RETRIEVAL OF SONG LYRICS FROM SUNG QUERIES.

Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous. UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS

Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz, Gabriel Synnaeve, Emmanuel Dupoux LEARNING FILTERBANKS FROM RAW SPEECH FOR PHONE RECOGNITION

lostanlen commented 6 years ago

For SONYC and BirdVox: Huy Phan, Martin Krawczyk-Becker, Timo Gerkmann, and Alfred Mertins WEIGHTED AND MULTI-TASK LOSS FOR RARE AUDIO EVENT DETECTION

For BirdVox only: (so not in the MIR meeting itself most likely) Zhao Zhao, Sai-Hua Zhang, Zhao Zhao, Zhi-Yong Xu, Kristen Bellisario, Bryan C. Pijanowski AUTOMATIC BIRD VOCALIZATION IDENTIFICATION BASED ON FUSION OF SPECTRAL PATTERN AND TEXTURE FEATURES

mcartwright commented 6 years ago

Hey, I'll send some tomorrow.

m

lostanlen commented 6 years ago

Robin Scheibler, Tokyo Metropolitan University, Japan; Eric Bezzam, EPFL-IC-LCAV, Switzerland; Ivan Dokmanic, University of Illinois at Urbana-Champaign, United States PYROOMACOUSTICS: A PYTHON PACKAGE FOR AUDIO ROOM SIMULATION AND ARRAY PROCESSING ALGORITHMS

I bring this to your attention, not that much about the techniques involved in here (room acoustics) than about the possibility to integrate, but about the possibility of making this a muda dependency. This Python library synthesizes impulse responses given the dimensions of a room, the absorption properties of its surfaces, the location of the source(s) in the room, and the location of the microphone. Because this is a theoretical model, there is an accurate estimation of group delay, which is important for tasks such as beat tracking.

jongwook commented 6 years ago

John Thickstun, Zaid Harchaoui, Dean Foster, Sham M. Kakade INVARIANCES AND DATA AUGMENTATION FOR SUPERVISED MUSIC TRANSCRIPTION

The word 'invariance' doesn't mean anything more than Conv2D exploits translation invariance in images, hence pitch-shift invariance in log-frequency spectrograms. The first layer is very similar to CQT but not exactly, and the window doesn't get narrower in higher frequencies. There was a question about 'why not just CQT?', but I couldn't totally understand the answer which was concerning the parameters in librosa's cqt.

I think the log-frequency representation itself has no reason to be better than CQT, and the good performance was possible thanks to the MusicNet dataset and some novelty in the filters between layers 2-3 and 3-4.

jongwook commented 6 years ago

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions paper demo slides

The Tacotron 2 paper, for the completeness

mcartwright commented 6 years ago

Ok, here is my list of papers that looked interesting or at least relevant enough that we should know about it. If I had to just pick one paper for us to read next week though, I would vote for UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS, since this is very relevant to some work that the SONYC team might embark on very soon.

Sound Event Detection

KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES Anurag Kumar, Maksim Khadkevich, Christian Fugen

WEIGHTED AND MULTI-TASK LOSS FOR RARE AUDIO EVENT DETECTION Huy Phan, Martin Krawczyk-Beckery, Timo Gerkmanny, and Alfred Mertins

A JOINT SEPARATION-CLASSIFICATION MODEL FOR SOUND EVENT DETECTION OF WEAKLY LABELLED DATA Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley

FACEBOOK ACOUSTIC EVENTS DATASET Haoqi Fan, Jiatong Zhou, Christian Fuegen

UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous

LARGE-SCALE WEAKLY SUPERVISED AUDIO CLASSIFICATION USING GATED CONVOLUTIONAL NEURAL NETWORK Yong Xu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

GAN-based source separation

GENERATIVE ADVERSARIAL SOURCE SEPARATION Y.Cem Subakan, Paris Smaragdis

ADVERSARIAL SEMI-SUPERVISED AUDIO SOURCE SEPARATION APPLIED TO SINGING VOICE EXTRACTION Daniel Stoller, Sebastian Ewert, Simon Dixon

SVSGAN: SINGING VOICE SEPARATION VIA GENERATIVE ADVERSARIAL NETWORK Zhe-Cheng Fan, Yen-Lin Lai, Jyh-Shing R. Jang

Speech Synthesis

LIP2AUDSPEC: SPEECH RECONSTRUCTION FROM SILENT LIP MOVEMENTS VIDEO Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani

FFTNET: A REAL-TIME SPEAKER-DEPENDENT NEURAL VOCODER Zeyu Jin, Adam Finkelstein, Gautham J. Mysore, Jingwan Lu

NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM PREDICTIONS Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu

Other

SINGING STYLE INVESTIGATION BY RESIDUAL SIAMESE CONVOLUTIONAL NEURAL NETWORKS Cheng-i Wang and George Tzanetakis

TOWARDS LANGUAGE-UNIVERSAL END-TO-END SPEECH RECOGNITION Suyoun Kim, Michael L. Seltzer

SEMI-RECURRENT CNN-BASED VAE-GAN FOR SEQUENTIAL DATA GENERATION Mohammad Akbari and Jie Liang

TASNET: TIME-DOMAIN AUDIO SEPARATION NETWORK FOR REAL-TIME, SINGLE-CHANNEL SPEECH SEPARATION Yi Luo and Nima Mesgarani

lostanlen commented 6 years ago

Yeah the unsupervised paper from Google was 🔥

lostanlen commented 6 years ago

NB: the Facebook acoustic events was withdrawn. I think they had a problem with the dataset and had to postpone its release (I chaired that session)

jongwook commented 6 years ago

Adding a few more ...

LISTENING TO EACH SPEAKER ONE BY ONE WITH RECURRENT SELECTIVE HEARING NETWORKS Speaker voice separation, by iteratively separating one voice at a time, using a NN predicting the mask and the stop flag.

END-TO-END SOUND SOURCE ENHANCEMENT USING DEEP NEURAL NETWORK IN THE MODIFIED DISCRETE COSINE TRANSFORM DOMAIN This one was claiming to achieve a better result than SEGAN, by using MDCT.

MULTI-CHANNEL DEEP CLUSTERING: DISCRIMINATIVE SPECTRAL AND SPATIAL EMBEDDINGS FOR SPEAKER-INDEPENDENT SPEECH SEPARATION ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING I wasn't familiar with deep clustering for source separation ... so including here as a starting point for study

ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING Similar for CTC (connectionist temporal classification) which seemed to be quite popular among speech recognition papers

EXPLORING SPEECH ENHANCEMENT WITH GENERATIVE ADVERSARIAL NETWORKS FOR ROBUST SPEECH RECOGNITION Speech enhancement via GAN as a pre-processing layer is effective, but only in the frequency domain (Google)

bmcfee commented 6 years ago

Here's another highlight-reel list: http://www.jordipons.me/my-icassp-2018-highlights/