Closed bmcfee closed 6 years ago
sure thing
Anna Kruspe and Masataka Goto. RETRIEVAL OF SONG LYRICS FROM SUNG QUERIES.
Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous. UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS
Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz, Gabriel Synnaeve, Emmanuel Dupoux LEARNING FILTERBANKS FROM RAW SPEECH FOR PHONE RECOGNITION
For SONYC and BirdVox: Huy Phan, Martin Krawczyk-Becker, Timo Gerkmann, and Alfred Mertins WEIGHTED AND MULTI-TASK LOSS FOR RARE AUDIO EVENT DETECTION
For BirdVox only: (so not in the MIR meeting itself most likely) Zhao Zhao, Sai-Hua Zhang, Zhao Zhao, Zhi-Yong Xu, Kristen Bellisario, Bryan C. Pijanowski AUTOMATIC BIRD VOCALIZATION IDENTIFICATION BASED ON FUSION OF SPECTRAL PATTERN AND TEXTURE FEATURES
Hey, I'll send some tomorrow.
m
Robin Scheibler, Tokyo Metropolitan University, Japan; Eric Bezzam, EPFL-IC-LCAV, Switzerland; Ivan Dokmanic, University of Illinois at Urbana-Champaign, United States PYROOMACOUSTICS: A PYTHON PACKAGE FOR AUDIO ROOM SIMULATION AND ARRAY PROCESSING ALGORITHMS
I bring this to your attention, not that much about the techniques involved in here (room acoustics) than about the possibility to integrate, but about the possibility of making this a muda dependency. This Python library synthesizes impulse responses given the dimensions of a room, the absorption properties of its surfaces, the location of the source(s) in the room, and the location of the microphone. Because this is a theoretical model, there is an accurate estimation of group delay, which is important for tasks such as beat tracking.
John Thickstun, Zaid Harchaoui, Dean Foster, Sham M. Kakade INVARIANCES AND DATA AUGMENTATION FOR SUPERVISED MUSIC TRANSCRIPTION
The word 'invariance' doesn't mean anything more than Conv2D exploits translation invariance in images, hence pitch-shift invariance in log-frequency spectrograms. The first layer is very similar to CQT but not exactly, and the window doesn't get narrower in higher frequencies. There was a question about 'why not just CQT?', but I couldn't totally understand the answer which was concerning the parameters in librosa's cqt.
I think the log-frequency representation itself has no reason to be better than CQT, and the good performance was possible thanks to the MusicNet dataset and some novelty in the filters between layers 2-3 and 3-4.
Ok, here is my list of papers that looked interesting or at least relevant enough that we should know about it. If I had to just pick one paper for us to read next week though, I would vote for UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS, since this is very relevant to some work that the SONYC team might embark on very soon.
KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES Anurag Kumar, Maksim Khadkevich, Christian Fugen
WEIGHTED AND MULTI-TASK LOSS FOR RARE AUDIO EVENT DETECTION Huy Phan, Martin Krawczyk-Beckery, Timo Gerkmanny, and Alfred Mertins
A JOINT SEPARATION-CLASSIFICATION MODEL FOR SOUND EVENT DETECTION OF WEAKLY LABELLED DATA Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley
AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE Qiuqiang Kong, Yong Xu, Wenwu Wang, Mark D. Plumbley
FACEBOOK ACOUSTIC EVENTS DATASET Haoqi Fan, Jiatong Zhou, Christian Fuegen
UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous
LARGE-SCALE WEAKLY SUPERVISED AUDIO CLASSIFICATION USING GATED CONVOLUTIONAL NEURAL NETWORK Yong Xu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley
GENERATIVE ADVERSARIAL SOURCE SEPARATION Y.Cem Subakan, Paris Smaragdis
ADVERSARIAL SEMI-SUPERVISED AUDIO SOURCE SEPARATION APPLIED TO SINGING VOICE EXTRACTION Daniel Stoller, Sebastian Ewert, Simon Dixon
SVSGAN: SINGING VOICE SEPARATION VIA GENERATIVE ADVERSARIAL NETWORK Zhe-Cheng Fan, Yen-Lin Lai, Jyh-Shing R. Jang
LIP2AUDSPEC: SPEECH RECONSTRUCTION FROM SILENT LIP MOVEMENTS VIDEO Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani
FFTNET: A REAL-TIME SPEAKER-DEPENDENT NEURAL VOCODER Zeyu Jin, Adam Finkelstein, Gautham J. Mysore, Jingwan Lu
NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM PREDICTIONS Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu
SINGING STYLE INVESTIGATION BY RESIDUAL SIAMESE CONVOLUTIONAL NEURAL NETWORKS Cheng-i Wang and George Tzanetakis
TOWARDS LANGUAGE-UNIVERSAL END-TO-END SPEECH RECOGNITION Suyoun Kim, Michael L. Seltzer
SEMI-RECURRENT CNN-BASED VAE-GAN FOR SEQUENTIAL DATA GENERATION Mohammad Akbari and Jie Liang
TASNET: TIME-DOMAIN AUDIO SEPARATION NETWORK FOR REAL-TIME, SINGLE-CHANNEL SPEECH SEPARATION Yi Luo and Nima Mesgarani
Yeah the unsupervised paper from Google was 🔥
NB: the Facebook acoustic events was withdrawn. I think they had a problem with the dataset and had to postpone its release (I chaired that session)
Adding a few more ...
LISTENING TO EACH SPEAKER ONE BY ONE WITH RECURRENT SELECTIVE HEARING NETWORKS Speaker voice separation, by iteratively separating one voice at a time, using a NN predicting the mask and the stop flag.
END-TO-END SOUND SOURCE ENHANCEMENT USING DEEP NEURAL NETWORK IN THE MODIFIED DISCRETE COSINE TRANSFORM DOMAIN This one was claiming to achieve a better result than SEGAN, by using MDCT.
MULTI-CHANNEL DEEP CLUSTERING: DISCRIMINATIVE SPECTRAL AND SPATIAL EMBEDDINGS FOR SPEAKER-INDEPENDENT SPEECH SEPARATION ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING I wasn't familiar with deep clustering for source separation ... so including here as a starting point for study
ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING Similar for CTC (connectionist temporal classification) which seemed to be quite popular among speech recognition papers
EXPLORING SPEECH ENHANCEMENT WITH GENERATIVE ADVERSARIAL NETWORKS FOR ROBUST SPEECH RECOGNITION Speech enhancement via GAN as a pre-processing layer is effective, but only in the frequency domain (Google)
Here's another highlight-reel list: http://www.jordipons.me/my-icassp-2018-highlights/
ICASSP reca(ss)p?
On the hook: @lostanlen @jongwook @mcartwright