Abstract
Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks. We propose a number of improvements that make transformers outperform the state of the art for the first time. (1) We show that a hybrid architecture is more effective than plain transformers, by a large margin. (2) We introduce two branches collecting global (classification token) and local (patch tokens) information, from which we form a global image representation. (3) In each branch, we collect multi-layer features from the transformer encoder, corresponding to skip connections across distant layers. (4) We enhance locality of interactions at the deeper layers of the encoder, which is the relative weakness of vision transformers. We train our model on all commonly used training sets and, for the first time, we make fair comparisons separately per training set. In all cases, we outperform previous models based on global representation. Public code is available at https://github.com/dealicious-inc/DToP.
Keyword: self-supervised
Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech
Authors: Cheol Jun Cho, Peter Wu, Abdelrahman Mohamed, Gopala K. Anumanchipalli
Abstract
Numerous self-supervised learning (SSL) models for speech have been proposed for pre-training models of speech representations, and recent SSL models are very successful in diverse downstream tasks. To understand such utilities, previous works probe representations of speech models to reveal which & how speech related information is encoded in the learned representations. While encoding properties have been extensively explored from the perspective of acoustics, phonetics, and semantics, the physical grounding by speech production has not yet received full attention. To bridge this gap, we conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA). Our analysis is based on a linear probing approach where we measure articulatory score as an average correlation of linear mapping to EMA. We analyze a set of SSL models selected from the leaderboard of the SU- PERB benchmark and perform further detailed analyses on two major models, Wav2Vec 2.0 and HuBERT. Surprisingly, representations from the recent speech SSL models are highly correlated with EMA traces (best: r = 0.81), and only 5 minutes were sufficient to train a linear model with high performance (r = 0.77). Our findings suggest that SSL models learn to closely align with continuous articulations and provide a novel insight into speech SSL.
Self-Supervised Pretraining on Satellite Imagery: a Case Study on Label-Efficient Vehicle Detection
Abstract
In defense-related remote sensing applications, such as vehicle detection on satellite imagery, supervised learning requires a huge number of labeled examples to reach operational performances. Such data are challenging to obtain as it requires military experts, and some observables are intrinsically rare. This limited labeling capability, as well as the large number of unlabeled images available due to the growing number of sensors, make object detection on remote sensing imagery highly relevant for self-supervised learning. We study in-domain self-supervised representation learning for object detection on very high resolution optical satellite imagery, that is yet poorly explored. For the first time to our knowledge, we study the problem of label efficiency on this task. We use the large land use classification dataset Functional Map of the World to pretrain representations with an extension of the Momentum Contrast framework. We then investigate this model's transferability on a real-world task of fine-grained vehicle detection and classification on Preligens proprietary data, which is designed to be representative of an operational use case of strategic site surveillance. We show that our in-domain self-supervised learning model is competitive with ImageNet pretraining, and outperforms it in the low-label regime.
Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer
Authors: Jan Švec, Jan Lehečka, Luboš Šmídl
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
In recent years, the standard hybrid DNN-HMM speech recognizers are outperformed by the end-to-end speech recognition systems. One of the very promising approaches is the grapheme Wav2Vec 2.0 model, which uses the self-supervised pretraining approach combined with transfer learning of the fine-tuned speech recognizer. Since it lacks the pronunciation vocabulary and language model, the approach is suitable for tasks where obtaining such models is not easy or almost impossible. In this paper, we use the Wav2Vec speech recognizer in the task of spoken term detection over a large set of spoken documents. The method employs a deep LSTM network which maps the recognized hypothesis and the searched term into a shared pronunciation embedding space in which the term occurrences and the assigned scores are easily computed. The paper describes a bootstrapping approach that allows the transfer of the knowledge contained in traditional pronunciation vocabulary of DNN-HMM hybrid ASR into the context of grapheme-based Wav2Vec. The proposed method outperforms the previously published system based on the combination of the DNN-HMM hybrid ASR and phoneme recognizer by a large margin on the MALACH data in both English and Czech languages.
Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings
Authors: Jan Švec, Luboš Šmídl, Josef V. Psutka, Aleš Pražák
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.
Abstract
Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks. We propose a number of improvements that make transformers outperform the state of the art for the first time. (1) We show that a hybrid architecture is more effective than plain transformers, by a large margin. (2) We introduce two branches collecting global (classification token) and local (patch tokens) information, from which we form a global image representation. (3) In each branch, we collect multi-layer features from the transformer encoder, corresponding to skip connections across distant layers. (4) We enhance locality of interactions at the deeper layers of the encoder, which is the relative weakness of vision transformers. We train our model on all commonly used training sets and, for the first time, we make fair comparisons separately per training set. In all cases, we outperform previous models based on global representation. Public code is available at https://github.com/dealicious-inc/DToP.
Face Pyramid Vision Transformer
Authors: Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction (FDR) layers are employed to make the feature maps compact, thus reducing the computations. An Improved Patch Embedding (IPE) algorithm is proposed to exploit the benefits of CNNs in ViTs (e.g., shared weights, local context, and receptive fields) to model lower-level edges to higher-level semantic primitives. Within FPVT framework, a Convolutional Feed-Forward Network (CFFN) is proposed that extracts locality information to learn low level facial information. The proposed FPVT is evaluated on seven benchmark datasets and compared with ten existing state-of-the-art methods, including CNNs, pure ViTs, and Convolutional ViTs. Despite fewer parameters, FPVT has demonstrated excellent performance over the compared methods. Project page is available at https://khawar-islam.github.io/fpvt/
Keyword: multimodal
Composing Ensembles of Pre-trained Models via Iterative Consensus
Authors: Shuang Li, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Igor Mordatch
Abstract
Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions. In this work, we propose a unified framework for composing ensembles of different pre-trained models -- combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5%, without requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation. Project page: https://energy-based-model.github.io/composing-pretrained-models.
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
Abstract
Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks. Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention. We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer quantity) of pretraining data is essential. Additionally, the best performing models leverage fine-grained multimodal pretraining objectives in addition to the standard image-text matching objectives. This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.
Keyword: CLIP
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
Authors: Vivian Liu, Jo Vermeulen, George Fitzmaurice, Justin Matejka
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multimedia (cs.MM)
Abstract
Text-to-image AI systems are capable of generating novel images for inspiration, but their applications for 3D design workflows and how designers can build 3D models using AI-provided inspiration is less understood. To investigate this, we integrated DALL-E, GPT-3, and CLIP within a CAD software in 3DALL-E, a plugin that allows users to construct text and image prompts based on what they are modelling. In a study with 13 designers, we found that designers saw great potential to incorporate 3DALL-E into their workflows and to use text-to-image AI for reference images, renders, materials, and design considerations. Additionally, we elaborate on prompting patterns and provide measures of prompt complexity observed across participants. We conclude on a discussion of how 3DALL-E can merge with existing generative design workflows and propose prompt bibliographies as a form of human-AI design history.
GaitMAST: Motion-Aware Spatio-Temporal Feature Learning Network for Cross-View Gait Recognition
Abstract
As a unique biometric that can be perceived at a distance, gait has broad applications in person authentication, social security and so on. Existing gait recognition methods pay attention to extracting either spatial or spatiotemporal representations. However, they barely consider extracting diverse motion features, a fundamental characteristic in gaits, from gait sequences. In this paper, we propose a novel motion-aware spatiotemporal feature learning network for gait recognition, termed GaitMAST, which can unleash the potential of motion-aware features. In the shallow layer, specifically, we propose a dual-path frame-level feature extractor, in which one path extracts overall spatiotemporal features and the other extracts motion salient features by focusing on dynamic regions. In the deeper layers, we design a two-branch clip-level feature extractor, in which one focuses on fine-grained spatial information and the other on motion detail preservation. Consequently, our GaitMAST preserves the individual's unique walking patterns well, further enhancing the robustness of spatiotemporal features. Extensive experimental results on two commonly-used cross-view gait datasets demonstrate the superior performance of GaitMAST over existing state-of-the-art methods. On CASIA-B, our model achieves an average rank-1 accuracy of 94.1%. In particular, GaitMAST achieves rank-1 accuracies of 96.1% and 88.1% under the bag-carry and coat wearing conditions, respectively, outperforming the second best by a large margin and demonstrating its robustness against spatial variations.
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding
Authors: Yuechen Wang, Wengang Zhou, Houqiang Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-of-the-art performance.
Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards
Abstract
Derivative-free prompt learning has emerged as a lightweight alternative to prompt tuning, which only requires model inference to optimize the prompts. However, existing work did not take full advantage of the over-parameterized characteristics of large pre-trained language models (PLMs). In this paper, we propose Clip-Tuning, a simple yet effective method that adopts diverse frozen "thinned" networks of PLMs to obtain a mixture of rewards and thus advance the derivative-free prompt learning. The thinned networks consist of all the hidden units that survive a stationary dropout strategy, whose inference predictions reflect an ensemble of partial views over prompted training samples. Our method outperforms previous gradient-free prompt learning methods and achieves parity with gradient-based counterparts on seven language understanding benchmarks under few-shot settings.
Keyword: metric learning
There is no result
Keyword: image retrieval
Boosting vision transformers for image retrieval
Keyword: self-supervised
Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech
Self-Supervised Pretraining on Satellite Imagery: a Case Study on Label-Efficient Vehicle Detection
Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer
Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings
Keyword: vision transformer
Boosting vision transformers for image retrieval
Face Pyramid Vision Transformer
Keyword: multimodal
Composing Ensembles of Pre-trained Models via Iterative Consensus
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
Keyword: CLIP
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
GaitMAST: Motion-Aware Spatio-Temporal Feature Learning Network for Cross-View Gait Recognition
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding
Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards
Keyword: DALLE
There is no result