Abstract
Recent years, graph contrastive learning (GCL), which aims to learn representations from unlabeled graphs, has made great progress. However, the existing GCL methods mostly adopt human-designed graph augmentations, which are sensitive to various graph datasets. In addition, the contrastive losses originally developed in computer vision have been directly applied to graph data, where the neighboring nodes are regarded as negatives and consequently pushed far apart from the anchor. However, this is contradictory with the homophily assumption of networks that connected nodes often belong to the same class and should be close to each other. In this work, we propose an end-to-end automatic GCL method, named NCLA to apply neighbor contrastive learning on learnable graph augmentation. Several graph augmented views with adaptive topology are automatically learned by the multi-head graph attention mechanism, which can be compatible with various graph datasets without prior domain knowledge. In addition, a neighbor contrastive loss is devised to allow multiple positives per anchor by taking network topology as the supervised signals. Both augmentations and embeddings are learned end-to-end in the proposed NCLA. Extensive experiments on the benchmark datasets demonstrate that NCLA yields the state-of-the-art node classification performance on self-supervised GCL and even exceeds the supervised ones, when the labels are extremely limited. Our code is released at https://github.com/shenxiaocam/NCLA.
Semi-MAE: Masked Autoencoders for Semi-supervised Vision Transformers
Authors: Haojie Yu, Kang Zhao, Xiaoming Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Vision Transformer (ViT) suffers from data scarcity in semi-supervised learning (SSL). To alleviate this issue, inspired by masked autoencoder (MAE), which is a data-efficient self-supervised learner, we propose Semi-MAE, a pure ViT-based SSL framework consisting of a parallel MAE branch to assist the visual representation learning and make the pseudo labels more accurate. The MAE branch is designed as an asymmetric architecture consisting of a lightweight decoder and a shared-weights encoder. We feed the weakly-augmented unlabeled data with a high masking ratio to the MAE branch and reconstruct the missing pixels. Semi-MAE achieves 75.9% top-1 accuracy on ImageNet with 10% labels, surpassing prior state-of-the-art in semi-supervised image classification. In addition, extensive experiments demonstrate that Semi-MAE can be readily used for other ViT models and masked image modeling methods.
MoBYv2AL: Self-supervised Active Learning for Image Classification
Authors: Razvan Caramalau, Binod Bhattarai, Danail Stoyanov, Tae-Kyun Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Active learning(AL) has recently gained popularity for deep learning(DL) models. This is due to efficient and informative sampling, especially when the learner requires large-scale labelled datasets. Commonly, the sampling and training happen in stages while more batches are added. One main bottleneck in this strategy is the narrow representation learned by the model that affects the overall AL selection. We present MoBYv2AL, a novel self-supervised active learning framework for image classification. Our contribution lies in lifting MoBY, one of the most successful self-supervised learning algorithms, to the AL pipeline. Thus, we add the downstream task-aware objective function and optimize it jointly with contrastive loss. Further, we derive a data-distribution selection function from labelling the new examples. Finally, we test and study our pipeline robustness and performance for image classification tasks. We successfully achieved state-of-the-art results when compared to recent AL methods. Code available: https://github.com/razvancaramalau/MoBYv2AL
Learning Decorrelated Representations Efficiently Using Fast Fourier Transform
Abstract
Barlow Twins and VICReg are self-supervised representation learning models that use regularizers to decorrelate features. Although they work as well as conventional representation learning models, their training can be computationally demanding if the dimension of projected representations is high; as these regularizers are defined in terms of individual elements of a cross-correlation or covariance matrix, computing the loss for $d$-dimensional projected representations of $n$ samples takes $O(n d^2)$ time. In this paper, we propose a relaxed version of decorrelating regularizers that can be computed in $O(n d\log d)$ time by the fast Fourier transform. We also propose an inexpensive trick to mitigate the undesirable local minima that develop with the relaxation. Models learning representations using the proposed regularizers show comparable accuracy to existing models in downstream tasks, whereas the training requires less memory and is faster when $d$ is large.
Keyword: vision transformer
Explainability and Robustness of Deep Visual Classification Models
Authors: Jindong Gu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In the computer vision community, Convolutional Neural Networks (CNNs), first proposed in the 1980's, have become the standard visual classification model. Recently, as alternatives to CNNs, Capsule Networks (CapsNets) and Vision Transformers (ViTs) have been proposed. CapsNets, which were inspired by the information processing of the human brain, are considered to have more inductive bias than CNNs, whereas ViTs are considered to have less inductive bias than CNNs. All three classification models have received great attention since they can serve as backbones for various downstream tasks. However, these models are far from being perfect. As pointed out by the community, there are two weaknesses in standard Deep Neural Networks (DNNs). One of the limitations of DNNs is the lack of explainability. Even though they can achieve or surpass human expert performance in the image classification task, the DNN-based decisions are difficult to understand. In many real-world applications, however, individual decisions need to be explained. The other limitation of DNNs is adversarial vulnerability. Concretely, the small and imperceptible perturbations of inputs can mislead DNNs. The vulnerability of deep neural networks poses challenges to current visual classification models. The potential threats thereof can lead to unacceptable consequences. Besides, studying model adversarial vulnerability can lead to a better understanding of the underlying models. Our research aims to address the two limitations of DNNs. Specifically, we focus on deep visual classification models, especially the core building parts of each classification model, e.g. dynamic routing in CapsNets and self-attention module in ViTs.
Semi-MAE: Masked Autoencoders for Semi-supervised Vision Transformers
Authors: Haojie Yu, Kang Zhao, Xiaoming Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Vision Transformer (ViT) suffers from data scarcity in semi-supervised learning (SSL). To alleviate this issue, inspired by masked autoencoder (MAE), which is a data-efficient self-supervised learner, we propose Semi-MAE, a pure ViT-based SSL framework consisting of a parallel MAE branch to assist the visual representation learning and make the pseudo labels more accurate. The MAE branch is designed as an asymmetric architecture consisting of a lightweight decoder and a shared-weights encoder. We feed the weakly-augmented unlabeled data with a high masking ratio to the MAE branch and reconstruct the missing pixels. Semi-MAE achieves 75.9% top-1 accuracy on ImageNet with 10% labels, surpassing prior state-of-the-art in semi-supervised image classification. In addition, extensive experiments demonstrate that Semi-MAE can be readily used for other ViT models and masked image modeling methods.
Keyword: metric learning
There is no result
Keyword: image retrieval
There is no result
Keyword: self-supervised
Neighbor Contrastive Learning on Learnable Graph Augmentation
Semi-MAE: Masked Autoencoders for Semi-supervised Vision Transformers
MoBYv2AL: Self-supervised Active Learning for Image Classification
Learning Decorrelated Representations Efficiently Using Fast Fourier Transform
Keyword: vision transformer
Explainability and Robustness of Deep Visual Classification Models
Semi-MAE: Masked Autoencoders for Semi-supervised Vision Transformers
Keyword: multimodal
There is no result
Keyword: CLIP
There is no result
Keyword: DALLE
There is no result