New submissions for Thursday, 13 June 2024 (showing 438 of 438 entries )

Keyword: differential privacy

Label Smoothing Improves Machine Unlearning

Authors: Zonglin Di, Zhaowei Zhu, Jinghan Jia, Jiancheng Liu, Zafar Takhirov, Bo Jiang, Yuanshun Yao, Sijia Liu, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07698
Pdf link: https://arxiv.org/pdf/2406.07698
Abstract The objective of machine unlearning (MU) is to eliminate previously learned data from a model. However, it is challenging to strike a balance between computation cost and performance when using existing MU techniques. Taking inspiration from the influence of label smoothing on model confidence and differential privacy, we propose a simple gradient-based MU approach that uses an inverse process of label smoothing. This work introduces UGradSL, a simple, plug-and-play MU approach that uses smoothed labels. We provide theoretical analyses demonstrating why properly introducing label smoothing improves MU performance. We conducted extensive experiments on six datasets of various sizes and different modalities, demonstrating the effectiveness and robustness of our proposed method. The consistent improvement in MU performance is only at a marginal cost of additional computations. For instance, UGradSL improves over the gradient ascent MU baseline by 66% unlearning accuracy without sacrificing unlearning efficiency.
DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report)
Authors: Yiping Wang, Yanhao Wang, Cen Chen
Subjects: Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07953
Pdf link: https://arxiv.org/pdf/2406.07953
Abstract The sliding window model of computation captures scenarios in which data are continually arriving in the form of a stream, and only the most recent $w$ items are used for analysis. In this setting, an algorithm needs to accurately track some desired statistics over the sliding window using a small space. When data streams contain sensitive information about individuals, the algorithm is also urgently needed to provide a provable guarantee of privacy. In this paper, we focus on the two fundamental problems of privately (1) estimating the frequency of an arbitrary item and (2) identifying the most frequent items (i.e., \emph{heavy hitters}), in the sliding window model. We propose \textsc{DPSW-Sketch}, a sliding window framework based on the count-min sketch that not only satisfies differential privacy over the stream but also approximates the results for frequency and heavy-hitter queries within bounded errors in sublinear time and space w.r.t.~$w$. Extensive experiments on five real-world and synthetic datasets show that \textsc{DPSW-Sketch} provides significantly better utility-privacy trade-offs than state-of-the-art methods.
Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning
Authors: Dariush Wahdany, Matthew Jagielski, Adam Dziedzic, Franziska Boenisch
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.08039
Pdf link: https://arxiv.org/pdf/2406.08039
Abstract Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DP-SGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy ($\varepsilon\le1)$ and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of pure DP. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL's high performance under strong privacy guarantees in challenging private learning setups.
Keyword: privacy

An Effective Approach to Scramble Multiple Diagnostic Imageries Using Chaos-Based Cryptography
Authors: Dr Chandra Sekhar Sanaboina, Tejaswini Yadla
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.07560
Pdf link: https://arxiv.org/pdf/2406.07560
Abstract Medical image encryption could aid in preserving patient privacy. In this article, we provide a chaotic system-based medical picture encryption method. The diffusion and permutation architecture was used. The permutation based on plain image and chaotic keys is offered to shuffle the plain picture's pixels to other rows and columns, weakening the strong connections between neighboring pixels. Diffusion is suggested to spread small changes of plain images to all of the pixels in cipher images to enhance the encryption effect. We analyze the chaotic behavior of the proposed system using various techniques and tests such as bifurcation plots, Lyapunov exponents, MSE, PSNR tests, and histogram analysis.
Guardians of Anonymity: Exploring Tactics to Combat Cyber Threats in Onion Routing Environments
Authors: Karwan Mustafa Kareem
Subjects: Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2406.07563
Pdf link: https://arxiv.org/pdf/2406.07563
Abstract Onion routing networks, also known as darknets, are private networks that enable anonymous communication over the Internet. They are used by individuals and organizations to protect their privacy, but they also attract cybercriminals who exploit the anonymity provided by these networks for illegal activities. This paper comprehensively analyzes cybercrime threats and countermeasures in onion routing networks. We review the various types of cybercrime that occur in these networks, including drug trafficking, fraud, hacking, and other illicit activities. We then discuss the challenges associated with detecting and mitigating cybercrime in onion routing networks, such as the difficulty of tracing illegal activities back to their source due to the strong anonymity guarantees provided by these networks. We also explore the countermeasures that have been proposed and implemented to combat cybercrime in onion routing networks, including law enforcement efforts, technological solutions, and policy interventions. Finally, we highlight the limitations of existing countermeasures and identify potential directions for future research in this area, including the need for interdisciplinary approaches that combine technical, legal, and social perspectives to effectively combat cybercrime in onion routing networks.
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Authors: Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.07594
Pdf link: https://arxiv.org/pdf/2406.07594
Abstract Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.
Adversarial Machine Unlearning
Authors: Zonglin Di, Sixie Yu, Yevgeniy Vorobeychik, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.07687
Pdf link: https://arxiv.org/pdf/2406.07687
Abstract This paper focuses on the challenge of machine unlearning, aiming to remove the influence of specific training data on machine learning models. Traditionally, the development of unlearning algorithms runs parallel with that of membership inference attacks (MIA), a type of privacy threat to determine whether a data instance was used for training. However, the two strands are intimately connected: one can view machine unlearning through the lens of MIA success with respect to removed data. Recognizing this connection, we propose a game-theoretic framework that integrates MIAs into the design of unlearning algorithms. Specifically, we model the unlearning problem as a Stackelberg game in which an unlearner strives to unlearn specific training data from a model, while an auditor employs MIAs to detect the traces of the ostensibly removed data. Adopting this adversarial perspective allows the utilization of new attack advancements, facilitating the design of unlearning algorithms. Our framework stands out in two ways. First, it takes an adversarial approach and proactively incorporates the attacks into the design of unlearning algorithms. Secondly, it uses implicit differentiation to obtain the gradients that limit the attacker's success, thus benefiting the process of unlearning. We present empirical results to demonstrate the effectiveness of the proposed approach for machine unlearning.
Label Smoothing Improves Machine Unlearning
Authors: Zonglin Di, Zhaowei Zhu, Jinghan Jia, Jiancheng Liu, Zafar Takhirov, Bo Jiang, Yuanshun Yao, Sijia Liu, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07698
Pdf link: https://arxiv.org/pdf/2406.07698
Abstract The objective of machine unlearning (MU) is to eliminate previously learned data from a model. However, it is challenging to strike a balance between computation cost and performance when using existing MU techniques. Taking inspiration from the influence of label smoothing on model confidence and differential privacy, we propose a simple gradient-based MU approach that uses an inverse process of label smoothing. This work introduces UGradSL, a simple, plug-and-play MU approach that uses smoothed labels. We provide theoretical analyses demonstrating why properly introducing label smoothing improves MU performance. We conducted extensive experiments on six datasets of various sizes and different modalities, demonstrating the effectiveness and robustness of our proposed method. The consistent improvement in MU performance is only at a marginal cost of additional computations. For instance, UGradSL improves over the gradient ascent MU baseline by 66% unlearning accuracy without sacrificing unlearning efficiency.
Regularizing and Aggregating Clients with Class Distribution for Personalized Federated Learning
Authors: Gyuejeong Lee, Daeyoung Choi
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2406.07800
Pdf link: https://arxiv.org/pdf/2406.07800
Abstract Personalized federated learning (PFL) enables customized models for clients with varying data distributions. However, existing PFL methods often incur high computational and communication costs, limiting their practical application. This paper proposes a novel PFL method, Class-wise Federated Averaging (cwFedAVG), that performs Federated Averaging (FedAVG) class-wise, creating multiple global models per class on the server. Each local model integrates these global models weighted by its estimated local class distribution, derived from the L2-norms of deep network weights, avoiding privacy violations. Afterward, each global model does the same with local models using the same method. We also newly designed Weight Distribution Regularizer (WDR) to further enhance the accuracy of estimating a local class distribution by minimizing the Euclidean distance between the class distribution and the weight norms' distribution. Experimental results demonstrate that cwFedAVG matches or outperforms several existing PFL methods. Notably, cwFedAVG is conceptually simple yet computationally efficient as it mitigates the need for extensive calculation to collaborate between clients by leveraging shared global models. Visualizations provide insights into how cwFedAVG enables local model specialization on respective class distributions while global models capture class-relevant information across clients.
Small Scale Data-Free Knowledge Distillation
Authors: He Liu, Yikai Wang, Huaping Liu, Fuchun Sun, Anbang Yao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07876
Pdf link: https://arxiv.org/pdf/2406.07876
Abstract Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data, avoiding privacy, security, and proprietary risks in real applications. In this line of research, existing methods typically follow an inversion-and-distillation paradigm in which a generative adversarial network on-the-fly trained with the guidance of the pre-trained teacher network is used to synthesize a large-scale sample set for knowledge distillation. In this paper, we reexamine this common data-free knowledge distillation paradigm, showing that there is considerable room to improve the overall training efficiency through a lens of ``small-scale inverted data for knowledge distillation". In light of three empirical observations indicating the importance of how to balance class distributions in terms of synthetic sample diversity and difficulty during both data inversion and distillation processes, we propose Small Scale Data-free Knowledge Distillation SSD-KD. In formulation, SSD-KD introduces a modulating function to balance synthetic samples and a priority sampling function to select proper samples, facilitated by a dynamic replay buffer and a reinforcement learning strategy. As a result, SSD-KD can perform distillation training conditioned on an extremely small scale of synthetic samples (e.g., 10X less than the original training data scale), making the overall training efficiency one or two orders of magnitude faster than many mainstream methods while retaining superior or competitive model performance, as demonstrated on popular image classification and semantic segmentation benchmarks. The code is available at this https URL.
GENIU: A Restricted Data Access Unlearning for Imbalanced Data
Authors: Chenhao Zhang, Shaofei Shen, Yawen Zhao, Weitong Tony Chen, Miao Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07885
Pdf link: https://arxiv.org/pdf/2406.07885
Abstract With the increasing emphasis on data privacy, the significance of machine unlearning has grown substantially. Class unlearning, which involves enabling a trained model to forget data belonging to a specific class learned before, is important as classification tasks account for the majority of today's machine learning as a service (MLaaS). Retraining the model on the original data, excluding the data to be forgotten (a.k.a forgetting data), is a common approach to class unlearning. However, the availability of original data during the unlearning phase is not always guaranteed, leading to the exploration of class unlearning with restricted data access. While current unlearning methods with restricted data access usually generate proxy sample via the trained neural network classifier, they typically focus on training and forgetting balanced data. However, the imbalanced original data can cause trouble for these proxies and unlearning, particularly when the forgetting data consists predominantly of the majority class. To address this issue, we propose the GENerative Imbalanced Unlearning (GENIU) framework. GENIU utilizes a Variational Autoencoder (VAE) to concurrently train a proxy generator alongside the original model. These generated proxies accurately represent each class and are leveraged in the unlearning phase, eliminating the reliance on the original training data. To further mitigate the performance degradation resulting from forgetting the majority class, we introduce an in-batch tuning strategy that works with the generated proxies. GENIU is the first practical framework for class unlearning in imbalanced data settings and restricted data access, ensuring the preservation of essential information for future unlearning. Experimental results confirm the superiority of GENIU over existing methods, establishing its effectiveness in empirical scenarios.
Graph Transductive Defense: a Two-Stage Defense for Graph Membership Inference Attacks
Authors: Peizhi Niu, Chao Pan, Siheng Chen, Olgica Milenkovic
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.07917
Pdf link: https://arxiv.org/pdf/2406.07917
Abstract Graph neural networks (GNNs) have become instrumental in diverse real-world applications, offering powerful graph learning capabilities for tasks such as social networks and medical data analysis. Despite their successes, GNNs are vulnerable to adversarial attacks, including membership inference attacks (MIA), which threaten privacy by identifying whether a record was part of the model's training data. While existing research has explored MIA in GNNs under graph inductive learning settings, the more common and challenging graph transductive learning setting remains understudied in this context. This paper addresses this gap and proposes an effective two-stage defense, Graph Transductive Defense (GTD), tailored to graph transductive learning characteristics. The gist of our approach is a combination of a train-test alternate training schedule and flattening strategy, which successfully reduces the difference between the training and testing loss distributions. Extensive empirical results demonstrate the superior performance of our method (a decrease in attack AUROC by $9.42\%$ and an increase in utility performance by $18.08\%$ on average compared to LBP), highlighting its potential for seamless integration into various classification models with minimal overhead.
Ents: An Efficient Three-party Training Framework for Decision Trees by Communication Optimization
Authors: Guopeng Lin, Weili Han, Wenqiang Ruan, Ruisheng Zhou, Lushan Song, Bingshuai Li, Yunfeng Shao
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.07948
Pdf link: https://arxiv.org/pdf/2406.07948
Abstract Multi-party training frameworks for decision trees based on secure multi-party computation enable multiple parties to train high-performance models on distributed private data with privacy preservation. The training process essentially involves frequent dataset splitting according to the splitting criterion (e.g. Gini impurity). However, existing multi-party training frameworks for decision trees demonstrate communication inefficiency due to the following issues: (1) They suffer from huge communication overhead in securely splitting a dataset with continuous attributes. (2) They suffer from huge communication overhead due to performing almost all the computations on a large ring to accommodate the secure computations for the splitting criterion. In this paper, we are motivated to present an efficient three-party training framework, namely Ents, for decision trees by communication optimization. For the first issue, we present a series of training protocols based on the secure radix sort protocols to efficiently and securely split a dataset with continuous attributes. For the second issue, we propose an efficient share conversion protocol to convert shares between a small ring and a large ring to reduce the communication overhead incurred by performing almost all the computations on a large ring. Experimental results from eight widely used datasets show that Ents outperforms state-of-the-art frameworks by $5.5\times \sim 9.3\times$ in communication sizes and $3.9\times \sim 5.3\times$ in communication rounds. In terms of training time, Ents yields an improvement of $3.5\times \sim 6.7\times$. To demonstrate its practicality, Ents requires less than three hours to securely train a decision tree on a widely used real-world dataset (Skin Segmentation) with more than 245,000 samples in the WAN setting.
DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report)
Authors: Yiping Wang, Yanhao Wang, Cen Chen
Subjects: Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07953
Pdf link: https://arxiv.org/pdf/2406.07953
Abstract The sliding window model of computation captures scenarios in which data are continually arriving in the form of a stream, and only the most recent $w$ items are used for analysis. In this setting, an algorithm needs to accurately track some desired statistics over the sliding window using a small space. When data streams contain sensitive information about individuals, the algorithm is also urgently needed to provide a provable guarantee of privacy. In this paper, we focus on the two fundamental problems of privately (1) estimating the frequency of an arbitrary item and (2) identifying the most frequent items (i.e., \emph{heavy hitters}), in the sliding window model. We propose \textsc{DPSW-Sketch}, a sliding window framework based on the count-min sketch that not only satisfies differential privacy over the stream but also approximates the results for frequency and heavy-hitter queries within bounded errors in sublinear time and space w.r.t.~$w$. Extensive experiments on five real-world and synthetic datasets show that \textsc{DPSW-Sketch} provides significantly better utility-privacy trade-offs than state-of-the-art methods.
Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey
Authors: Shang Wang, Tianqing Zhu, Bo Liu, Ding Ming, Xu Guo, Dayong Ye, Wanlei Zhou
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.07973
Pdf link: https://arxiv.org/pdf/2406.07973
Abstract With the rapid development of artificial intelligence, large language models (LLMs) have made remarkable progress in natural language processing. These models are trained on large amounts of data to demonstrate powerful language understanding and generation capabilities for various applications, from machine translation and chatbots to agents. However, LLMs have exposed a variety of privacy and security issues during their life cycle, which have become the focus of academic and industrial attention. Moreover, these risks LLMs face are pretty different from previous traditional language models. Since current surveys lack a clear taxonomy of unique threat models based on diverse scenarios, we highlight unique privacy and security issues based on five scenarios: pre-training, fine-tuning, RAG system, deploying, and LLM-based agent. Concerning the characteristics of each risk, this survey provides potential threats and countermeasures. The research on attack and defense situations LLMs face can provide feasible research directions, making more areas reap LLMs' benefits.
A Federated Online Restless Bandit Framework for Cooperative Resource Allocation
Authors: Jingwen Tong, Xinran Li, Liqun Fu, Jun Zhang, Khaled B. Letaief
Subjects: Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2406.07992
Pdf link: https://arxiv.org/pdf/2406.07992
Abstract Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In this paper, we study the cooperative resource allocation problem with unknown system dynamics of MRPs. This problem can be modeled as a multi-agent online RMAB problem, where multiple agents collaboratively learn the system dynamics while maximizing their accumulated rewards. We devise a federated online RMAB framework to mitigate the communication overhead and data privacy issue by adopting the federated learning paradigm. Based on this framework, we put forth a Federated Thompson Sampling-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. The FedTSWI algorithm enjoys a high communication and computation efficiency, and a privacy guarantee. Moreover, we derive a regret upper bound for the FedTSWI algorithm. Finally, we demonstrate the effectiveness of the proposed algorithm on the case of online multi-user multi-channel access. Numerical results show that the proposed algorithm achieves a fast convergence rate of $\mathcal{O}(\sqrt{T\log(T)})$ and better performance compared with baselines. More importantly, its sample complexity decreases with the number of agents.
Metaverse Identity: Core Principles and Critical Challenges
Authors: Liang Yang, Yan Xu, Pan Hui
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2406.08029
Pdf link: https://arxiv.org/pdf/2406.08029
Abstract This paper explores the core principles that should guide the construction and governance of identity in the metaverse and identifies the critical challenges that need to be addressed. Drawing on multidisciplinary theories and perspectives, we propose two core principles for metaverse identity: \emph{Equivalence and Alignment}, and \emph{Fusion and Expansiveness}. The first principle contends that metaverse identities should be consistent with real-world identities in terms of norms and standards, which is crucial for establishing guidelines and safeguarding rights. The second principle emphasizes the necessity for seamless integration and boundless expansion of metaverse identities, transcending real-world limitations to accommodate diverse needs and foster inclusive participation. We argue that these two principles are vital for ensuring the accountability, inclusiveness, and consistency of identity in the metaverse. We also identify five critical challenges: Identity Interoperability, Legal Implications, Privacy and Identity Management, Deepfakes and Synthetic Identities, and Identity Fragmentation and Psychological Well-being. We discuss potential strategies to navigate these challenges. The paper concludes by underscoring the importance of a proactive and collaborative approach to shaping the future of metaverse identity. As the metaverse continues to evolve, it is imperative that we cultivate a thorough understanding of the principles and challenges surrounding identity in this uncharted territory and work collectively to build a metaverse that fosters responsible identity construction and expression.
Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning
Authors: Dariush Wahdany, Matthew Jagielski, Adam Dziedzic, Franziska Boenisch
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.08039
Pdf link: https://arxiv.org/pdf/2406.08039
Abstract Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DP-SGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy ($\varepsilon\le1)$ and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of pure DP. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL's high performance under strong privacy guarantees in challenging private learning setups.
Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding
Authors: Rui Wang, Liping Chen, Kong AiK Lee, Zhen-Hua Ling
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2406.08200
Pdf link: https://arxiv.org/pdf/2406.08200
Abstract Voice anonymization has been developed as a technique for preserving privacy by replacing the speaker's voice in a speech signal with that of a pseudo-speaker, thereby obscuring the original voice attributes from machine recognition and human perception. In this paper, we focus on altering the voice attributes against machine recognition while retaining human perception. We referred to this as the asynchronous voice anonymization. To this end, a speech generation framework incorporating a speaker disentanglement mechanism is employed to generate the anonymized speech. The speaker attributes are altered through adversarial perturbation applied on the speaker embedding, while human perception is preserved by controlling the intensity of perturbation. Experiments conducted on the LibriSpeech dataset showed that the speaker attributes were obscured with their human perception preserved for 60.71% of the processed utterances.
GPT4Rec: Graph Prompt Tuning for Streaming Recommendation
Authors: Peiyan Zhang, Yuchen Yan, Xi Zhang, Liying Kang, Chaozhuo Li, Feiran Huang, Senzhang Wang, Sunghun Kim
Subjects: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.08229
Pdf link: https://arxiv.org/pdf/2406.08229
Abstract In the realm of personalized recommender systems, the challenge of adapting to evolving user preferences and the continuous influx of new users and items is paramount. Conventional models, typically reliant on a static training-test approach, struggle to keep pace with these dynamic demands. Streaming recommendation, particularly through continual graph learning, has emerged as a novel solution. However, existing methods in this area either rely on historical data replay, which is increasingly impractical due to stringent data privacy regulations; or are inability to effectively address the over-stability issue; or depend on model-isolation and expansion strategies. To tackle these difficulties, we present GPT4Rec, a Graph Prompt Tuning method for streaming Recommendation. Given the evolving user-item interaction graph, GPT4Rec first disentangles the graph patterns into multiple views. After isolating specific interaction patterns and relationships in different views, GPT4Rec utilizes lightweight graph prompts to efficiently guide the model across varying interaction patterns within the user-item graph. Firstly, node-level prompts are employed to instruct the model to adapt to changes in the attributes or properties of individual nodes within the graph. Secondly, structure-level prompts guide the model in adapting to broader patterns of connectivity and relationships within the graph. Finally, view-level prompts are innovatively designed to facilitate the aggregation of information from multiple disentangled views. These prompt designs allow GPT4Rec to synthesize a comprehensive understanding of the graph, ensuring that all vital aspects of the user-item interactions are considered and effectively integrated. Experiments on four diverse real-world datasets demonstrate the effectiveness and efficiency of our proposal.
Dataset Enhancement with Instance-Level Augmentations
Authors: Orest Kupyn, Christian Rupprecht
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.08249
Pdf link: https://arxiv.org/pdf/2406.08249
Abstract We present a method for expanding a dataset by incorporating knowledge from the wide distribution of pre-trained latent diffusion models. Data augmentations typically incorporate inductive biases about the image formation process into the training (e.g. translation, scaling, colour changes, etc.). Here, we go beyond simple pixel transformations and introduce the concept of instance-level data augmentation by repainting parts of the image at the level of object instances. The method combines a conditional diffusion model with depth and edge maps control conditioning to seamlessly repaint individual objects inside the scene, being applicable to any segmentation or detection dataset. Used as a data augmentation method, it improves the performance and generalization of the state-of-the-art salient object detection, semantic segmentation and object detection models. By redrawing all privacy-sensitive instances (people, license plates, etc.), the method is also applicable for data anonymization. We also release fully synthetic and anonymized expansions for popular datasets: COCO, Pascal VOC and DUTS.
A deep cut into Split Federated Self-supervised Learning
Authors: Marcin Przewięźlikowski, Marcin Osial, Bartosz Zieliński, Marek Śmieja
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2406.08267
Pdf link: https://arxiv.org/pdf/2406.08267
Abstract Collaborative self-supervised learning has recently become feasible in highly distributed environments by dividing the network layers between client devices and a central server. However, state-of-the-art methods, such as MocoSFL, are optimized for network division at the initial layers, which decreases the protection of the client data and increases communication overhead. In this paper, we demonstrate that splitting depth is crucial for maintaining privacy and communication efficiency in distributed training. We also show that MocoSFL suffers from a catastrophic quality deterioration for the minimal communication overhead. As a remedy, we introduce Momentum-Aligned contrastive Split Federated Learning (MonAcoSFL), which aligns online and momentum client models during training procedure. Consequently, we achieve state-of-the-art accuracy while significantly reducing the communication overhead, making MonAcoSFL more practical in real-world scenarios.
Designing Child-Centered Content Exposure and Moderation
Authors: Belén Saldías
Subjects: Subjects: Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2406.08420
Pdf link: https://arxiv.org/pdf/2406.08420
Abstract Research on children's online experience and computer interaction often overlooks the relationship children have with hidden algorithms that control the content they encounter. Furthermore, it is not only about how children interact with targeted content but also how their development and agency are largely affected by these. By engaging with the body of literature at the intersection of i) human-centered design approaches, ii) exclusion and discrimination in A.I., iii) privacy, transparency, and accountability, and iv) children's online citizenship, this article dives into the question of "How can we approach the design of a child-centered moderation process to (1) include aspects that families value for their children and (2) provide explanations for content appropriateness and removal so that we can scale (according to systems and human needs) the moderation process assisted by A.I.?". This article contributes a sociotechnical highlight of core challenges and opportunities of designing child-centered content control tools. The article concludes by grounding and characterizing design considerations for a child-centered, family-guided moderation system. We hope this work serves as a stepping stone for designers and researchers pursuing children's safety online with an eye on hidden agents controlling children's online experiences and, by extension, the values and opportunities children are exposed to.
Keyword: machine learning

Individual Packet Features are a Risk to Model Generalisation in ML-Based Intrusion Detection
Authors: Kahraman Kostas, Mike Just, Michael A. Lones
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2406.07578
Pdf link: https://arxiv.org/pdf/2406.07578
Abstract Machine learning is increasingly used for intrusion detection in IoT networks. This paper explores the effectiveness of using individual packet features (IPF), which are attributes extracted from a single network packet, such as timing, size, and source-destination information. Through literature review and experiments, we identify the limitations of IPF, showing they can produce misleadingly high detection rates. Our findings emphasize the need for approaches that consider packet interactions for robust intrusion detection. Additionally, we demonstrate that models based on IPF often fail to generalize across datasets, compromising their reliability in diverse IoT environments.
A novel method for identifying rice seed purity based on hybrid machine learning algorithms
Authors: Phan Thi-Thu-Hong, Vo Quoc-Trinh, Nguyen Huu-Du
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2406.07581
Pdf link: https://arxiv.org/pdf/2406.07581
Abstract In the grain industry, the identification of seed purity is a crucial task as it is an important factor in evaluating the quality of seeds. For rice seeds, this property allows for the reduction of unexpected influences of other varieties on rice yield, nutrient composition, and price. However, in practice, they are often mixed with seeds from others. This study proposes a novel method for automatically identifying the rice seed purity of a certain rice variety based on hybrid machine learning algorithms. The main idea is to use deep learning architectures for extracting important features from the raw data and then use machine learning algorithms for classification. Several experiments are conducted following a practical implementation to evaluate the performance of the proposed model. The obtained results show that the novel method improves significantly the performance of existing methods. Thus, it can be applied to design effective identification systems for rice seed purity.
Equivariance via Minimal Frame Averaging for More Symmetries and Efficiency
Authors: Yuchao Lin, Jacob Helwig, Shurui Gui, Shuiwang Ji
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07598
Pdf link: https://arxiv.org/pdf/2406.07598
Abstract We consider achieving equivariance in machine learning systems via frame averaging. Current frame averaging methods involve a costly sum over large frames or rely on sampling-based approaches that only yield approximate equivariance. Here, we propose Minimal Frame Averaging (MFA), a mathematical framework for constructing provably minimal frames that are exactly equivariant. The general foundations of MFA also allow us to extend frame averaging to more groups than previously considered, including the Lorentz group for describing symmetries in space-time, and the unitary group for complex-valued domains. Results demonstrate the efficiency and effectiveness of encoding symmetries via MFA across a diverse range of tasks, including $n$-body simulation, top tagging in collider physics, and relaxed energy prediction. Our code is available at this https URL.
When is an Embedding Model More Promising than Another?
Authors: Maxime Darrin, Philippe Formont, Ismail Ben Ayed, Jackie CK Cheung, Pablo Piantanida
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.07640
Pdf link: https://arxiv.org/pdf/2406.07640
Abstract Embedders play a central role in machine learning, projecting any object into numerical representations that can, in turn, be leveraged to perform various downstream tasks. The evaluation of embedding models typically depends on domain-specific empirical approaches utilizing downstream tasks, primarily because of the lack of a standardized framework for comparison. However, acquiring adequately large and representative datasets for conducting these assessments is not always viable and can prove to be prohibitively expensive and time-consuming. In this paper, we present a unified approach to evaluate embedders. First, we establish theoretical foundations for comparing embedding models, drawing upon the concepts of sufficiency and informativeness. We then leverage these concepts to devise a tractable comparison criterion (information sufficiency), leading to a task-agnostic and self-supervised ranking procedure. We demonstrate experimentally that our approach aligns closely with the capability of embedding models to facilitate various downstream tasks in both natural language processing and molecular biology. This effectively offers practitioners a valuable tool for prioritizing model trials.
Watching Swarm Dynamics from Above: A Framework for Advanced Object Tracking in Drone Videos
Authors: Duc Pham, Matthew Hansen, Félicie Dhellemmens, Jens Krause, Pia Bideau
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.07680
Pdf link: https://arxiv.org/pdf/2406.07680
Abstract Easily accessible sensors, like drones with diverse onboard sensors, have greatly expanded studying animal behavior in natural environments. Yet, analyzing vast, unlabeled video data, often spanning hours, remains a challenge for machine learning, especially in computer vision. Existing approaches often analyze only a few frames. Our focus is on long-term animal behavior analysis. To address this challenge, we utilize classical probabilistic methods for state estimation, such as particle filtering. By incorporating recent advancements in semantic object segmentation, we enable continuous tracking of rapidly evolving object formations, even in scenarios with limited data availability. Particle filters offer a provably optimal algorithmic structure for recursively adding new incoming information. We propose a novel approach for tracking schools of fish in the open ocean from drone videos. Our framework not only performs classical object tracking in 2D, instead it tracks the position and spatial expansion of the fish school in world coordinates by fusing video data and the drone's on board sensor information (GPS and IMU). The presented framework for the first time allows researchers to study collective behavior of fish schools in its natural social and environmental context in a non-invasive and scalable way.
Adversarial Machine Unlearning
Authors: Zonglin Di, Sixie Yu, Yevgeniy Vorobeychik, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.07687
Pdf link: https://arxiv.org/pdf/2406.07687
Abstract This paper focuses on the challenge of machine unlearning, aiming to remove the influence of specific training data on machine learning models. Traditionally, the development of unlearning algorithms runs parallel with that of membership inference attacks (MIA), a type of privacy threat to determine whether a data instance was used for training. However, the two strands are intimately connected: one can view machine unlearning through the lens of MIA success with respect to removed data. Recognizing this connection, we propose a game-theoretic framework that integrates MIAs into the design of unlearning algorithms. Specifically, we model the unlearning problem as a Stackelberg game in which an unlearner strives to unlearn specific training data from a model, while an auditor employs MIAs to detect the traces of the ostensibly removed data. Adopting this adversarial perspective allows the utilization of new attack advancements, facilitating the design of unlearning algorithms. Our framework stands out in two ways. First, it takes an adversarial approach and proactively incorporates the attacks into the design of unlearning algorithms. Secondly, it uses implicit differentiation to obtain the gradients that limit the attacker's success, thus benefiting the process of unlearning. We present empirical results to demonstrate the effectiveness of the proposed approach for machine unlearning.
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles
Authors: Nirmalya Thakur, Vanessa Su, Mingchen Shao, Kesha A. Patel, Hongseok Jeong, Victoria Knieling, Andrew Brian
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2406.07693
Pdf link: https://arxiv.org/pdf/2406.07693
Abstract The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at this https URL. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. Finally, this paper also presents a list of open research questions that may be investigated using this dataset.
Sustainable self-supervised learning for speech representations
Authors: Luis Lugo, Valentin Vielzeuf
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2406.07696
Pdf link: https://arxiv.org/pdf/2406.07696
Abstract Sustainable artificial intelligence focuses on data, hardware, and algorithms to make machine learning models more environmentally responsible. In particular, machine learning models for speech representations are computationally expensive, generating environmental concerns because of their high energy consumption. Thus, we propose a sustainable self-supervised model to learn speech representation, combining optimizations in neural layers and training to reduce computing costs. The proposed model improves over a resource-efficient baseline, reducing both memory usage and computing cost estimations. It pretrains using a single GPU in less than a day. On top of that, it improves the error rate performance of the baseline in downstream task evaluations. When comparing it to large speech representation approaches, there is an order of magnitude reduction in memory usage, while computing cost reductions represent almost three orders of magnitude improvement.
Diagnosing and fixing common problems in Bayesian optimization for molecule design
Authors: Austin Tripp, José Miguel Hernández-Lobato
Subjects: Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.07709
Pdf link: https://arxiv.org/pdf/2406.07709
Abstract Bayesian optimization (BO) is a principled approach to molecular design tasks. In this paper we explain three pitfalls of BO which can cause poor empirical performance: an incorrect prior width, over-smoothing, and inadequate acquisition function maximization. We show that with these issues addressed, even a basic BO setup is able to achieve the highest overall performance on the PMO benchmark for molecule design (Gao et al, 2022). These results suggest that BO may benefit from more attention in the machine learning for molecules community.
Loss Gradient Gaussian Width based Generalization and Optimization Guarantees
Authors: Arindam Banerjee, Qiaobo Li, Yingxue Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07712
Pdf link: https://arxiv.org/pdf/2406.07712
Abstract Generalization and optimization guarantees on the population loss in machine learning often rely on uniform convergence based analysis, typically based on the Rademacher complexity of the predictors. The rich representation power of modern models has led to concerns about this approach. In this paper, we present generalization and optimization guarantees in terms of the complexity of the gradients, as measured by the Loss Gradient Gaussian Width (LGGW). First, we introduce generalization guarantees directly in terms of the LGGW under a flexible gradient domination condition, which we demonstrate to hold empirically for deep models. Second, we show that sample reuse in finite sum (stochastic) optimization does not make the empirical gradient deviate from the population gradient as long as the LGGW is small. Third, focusing on deep networks, we present results showing how to bound their LGGW under mild assumptions. In particular, we show that their LGGW can be bounded (a) by the $L_2$-norm of the loss Hessian eigenvalues, which has been empirically shown to be $\tilde{O}(1)$ for commonly used deep models; and (b) in terms of the Gaussian width of the featurizer, i.e., the output of the last-but-one layer. To our knowledge, our generalization and optimization guarantees in terms of LGGW are the first results of its kind, avoid the pitfalls of predictor Rademacher complexity based analysis, and hold considerable promise towards quantitatively tight bounds for deep models.
Unleashing the Power of Transfer Learning Model for Sophisticated Insect Detection: Revolutionizing Insect Classification
Authors: Md. Mahmudul Hasan, SM Shaqib, Ms. Sharmin Akter, Rabiul Alam, Afraz Ul Haque, Shahrun akter khushbu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.07716
Pdf link: https://arxiv.org/pdf/2406.07716
Abstract The purpose of the Insect Detection System for Crop and Plant Health is to keep an eye out for and identify insect infestations in farming areas. By utilizing cutting-edge technology like computer vision and machine learning, the system seeks to identify hazardous insects early and accurately. This would enable prompt response to save crops and maintain optimal plant health. The Method of this study includes Data Acquisition, Preprocessing, Data splitting, Model Implementation and Model evaluation. Different models like MobileNetV2, ResNet152V2, Xecption, Custom CNN was used in this study. In order to categorize insect photos, a Convolutional Neural Network (CNN) based on the ResNet152V2 architecture is constructed and evaluated in this work. Achieving 99% training accuracy and 97% testing accuracy, ResNet152V2 demonstrates superior performance among four implemented models. The results highlight its potential for real-world applications in insect classification and entomology studies, emphasizing efficiency and accuracy. To ensure food security and sustain agricultural output globally, finding insects is crucial. Cutting-edge technology, such as ResNet152V2 models, greatly influence automating and improving the accuracy of insect identification. Efficient insect detection not only minimizes crop losses but also enhances agricultural productivity, contributing to sustainable food production. This underscores the pivotal role of technology in addressing challenges related to global food security.
Efficient Parallel Multi-Hop Reasoning: A Scalable Approach for Knowledge Graph Analysis
Authors: Jesmin Jahan Tithi, Fabio Checconi, Fabrizio Petrini
Subjects: Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2406.07727
Pdf link: https://arxiv.org/pdf/2406.07727
Abstract Multi-hop reasoning (MHR) is a process in artificial intelligence and natural language processing where a system needs to make multiple inferential steps to arrive at a conclusion or answer. In the context of knowledge graphs or databases, it involves traversing multiple linked entities and relationships to understand complex queries or perform tasks requiring a deeper understanding. Multi-hop reasoning is a critical function in various applications, including question answering, knowledge base completion, and link prediction. It has garnered significant interest in artificial intelligence, machine learning, and graph analytics. This paper focuses on optimizing MHR for time efficiency on large-scale graphs, diverging from the traditional emphasis on accuracy which is an orthogonal goal. We introduce a novel parallel algorithm that harnesses domain-specific learned embeddings to efficiently identify the top K paths between vertices in a knowledge graph to find the best answers to a three-hop query. Our contributions are: (1) We present a new parallel algorithm to enhance MHR performance, scalability and efficiency. (2) We demonstrate the algorithm's superior performance on leading-edge Intel and AMD architectures through empirical results. We showcase the algorithm's practicality through a case study on identifying academic affiliations of potential Turing Award laureates in Deep Learning, highlighting its capability to handle intricate entity relationships. This demonstrates the potential of our approach to enabling high-performance MHR, useful to navigate the growing complexity of modern knowledge graphs.
DualBind: A Dual-Loss Framework for Protein-Ligand Binding Affinity Prediction
Authors: Meng Liu, Saee Gopal Paliwal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2406.07770
Pdf link: https://arxiv.org/pdf/2406.07770
Abstract Accurate prediction of protein-ligand binding affinities is crucial for drug development. Recent advances in machine learning show promising results on this task. However, these methods typically rely heavily on labeled data, which can be scarce or unreliable, or they rely on assumptions like Boltzmann-distributed data that may not hold true in practice. Here, we present DualBind, a novel framework that integrates supervised mean squared error (MSE) with unsupervised denoising score matching (DSM) to accurately learn the binding energy function. DualBind not only addresses the limitations of DSM-only models by providing more accurate absolute affinity predictions but also improves generalizability and reduces reliance on labeled data compared to MSE-only models. Our experimental results demonstrate that DualBind excels in predicting binding affinities and can effectively utilize both labeled and unlabeled data to enhance performance.
Evolutionary Computation and Explainable AI: A Roadmap to Transparent Intelligent Systems
Authors: Ryan Zhou, Jaume Bacardit, Alexander Brownlee, Stefano Cagnoni, Martin Fyvie, Giovanni Iacca, John McCall, Niki van Stein, David Walker, Ting Hu
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07811
Pdf link: https://arxiv.org/pdf/2406.07811
Abstract AI methods are finding an increasing number of applications, but their often black-box nature has raised concerns about accountability and trust. The field of explainable artificial intelligence (XAI) has emerged in response to the need for human understanding of AI models. Evolutionary computation (EC), as a family of powerful optimization and learning tools, has significant potential to contribute to XAI. In this paper, we provide an introduction to XAI and review various techniques in current use for explaining machine learning (ML) models. We then focus on how EC can be used in XAI, and review some XAI approaches which incorporate EC techniques. Additionally, we discuss the application of XAI principles within EC itself, examining how these principles can shed some light on the behavior and outcomes of EC algorithms in general, on the (automatic) configuration of these algorithms, and on the underlying problem landscapes that these algorithms optimize. Finally, we discuss some open challenges in XAI and opportunities for future research in this field using EC. Our aim is to demonstrate that EC is well-suited for addressing current problems in explainability and to encourage further exploration of these methods to contribute to the development of more transparent and trustworthy ML models and EC algorithms.
Scaling Manipulation Learning with Visual Kinematic Chain Prediction
Authors: Xinyu Zhang, Yuhan Liu, Haonan Chang, Abdeslam Boularias
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.07837
Pdf link: https://arxiv.org/pdf/2406.07837
Abstract Learning general-purpose models from diverse datasets has achieved great success in machine learning. In robotics, however, existing methods in multi-task learning are typically constrained to a single robot and workspace, while recent work such as RT-X requires a non-trivial action normalization procedure to manually bridge the gap between different action spaces in diverse environments. In this paper, we propose the visual kinematics chain as a precise and universal representation of quasi-static actions for robot learning over diverse environments, which requires no manual adjustment since the visual kinematic chains can be automatically obtained from the robot's model and camera parameters. We propose the Visual Kinematics Transformer (VKT), a convolution-free architecture that supports an arbitrary number of camera viewpoints, and that is trained with a single objective of forecasting kinematic structures through optimal point-set matching. We demonstrate the superior performance of VKT over BC transformers as a general agent on Calvin, RLBench, Open-X, and real robot manipulation tasks. Video demonstrations can be found at this https URL.
Asymptotically Optimal Regret for Black-Box Predict-then-Optimize
Authors: Samuel Tan, Peter I. Frazier
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2406.07866
Pdf link: https://arxiv.org/pdf/2406.07866
Abstract We consider the predict-then-optimize paradigm for decision-making in which a practitioner (1) trains a supervised learning model on historical data of decisions, contexts, and rewards, and then (2) uses the resulting model to make future binary decisions for new contexts by finding the decision that maximizes the model's predicted reward. This approach is common in industry. Past analysis assumes that rewards are observed for all actions for all historical contexts, which is possible only in problems with special structure. Motivated by problems from ads targeting and recommender systems, we study new black-box predict-then-optimize problems that lack this special structure and where we only observe the reward from the action taken. We present a novel loss function, which we call Empirical Soft Regret (ESR), designed to significantly improve reward when used in training compared to classical accuracy-based metrics like mean-squared error. This loss function targets the regret achieved when taking a suboptimal decision; because the regret is generally not differentiable, we propose a differentiable "soft" regret term that allows the use of neural networks and other flexible machine learning models dependent on gradient-based training. In the particular case of paired data, we show theoretically that optimizing our loss function yields asymptotically optimal regret within the class of supervised learning models. We also show our approach significantly outperforms state-of-the-art algorithms on real-world decision-making problems in news recommendation and personalized healthcare compared to benchmark methods from contextual bandits and conditional average treatment effect estimation.
A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects
Authors: Jun Bai, Di Wu, Tristan Shelley, Peter Schubel, David Twine, John Russell, Xuesen Zeng, Ji Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2406.07880
Pdf link: https://arxiv.org/pdf/2406.07880
Abstract Material defects (MD) represent a primary challenge affecting product performance and giving rise to safety issues in related products. The rapid and accurate identification and localization of MD constitute crucial research endeavours in addressing contemporary challenges associated with MD. Although conventional non-destructive testing methods such as ultrasonic and X-ray approaches have mitigated issues related to low efficiency in manual inspections, they struggle to meet the diverse requirements of high precision, real-time speed, automation, and intelligence. In recent years, propelled by the swift advancement of machine learning (ML) technologies, particularly exemplified by deep learning, ML has swiftly emerged as the core technology and a prominent research direction for material defect detection (MDD). Through a comprehensive review of the latest literature, we systematically survey the ML techniques applied in MDD into five categories: unsupervised learning, supervised learning, semi-supervised learning, reinforcement learning, and generative learning. We provide a detailed analysis of the main principles and techniques used, together with the advantages and potential challenges associated with these techniques. Furthermore, the survey focuses on the techniques for defect detection in composite materials, which are important types of materials enjoying increasingly wide application in various industries such as aerospace, automotive, construction, and renewable energy. Finally, the survey explores potential future directions in MDD utilizing ML technologies. This comprehensive survey not only consolidates existing literature on ML-based MDD technologies but also serves as a foundational reference for future researchers and industrial practitioners, providing valuable insights and guidance in developing advanced and efficient MDD systems.
Designing a Dashboard for Transparency and Control of Conversational AI
Authors: Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow, Martin Wattenberg, Fernanda Viégas
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2406.07882
Pdf link: https://arxiv.org/pdf/2406.07882
Abstract Conversational LLMs function as black box systems, leaving users guessing about why they see the output they do. This lack of transparency is potentially problematic, especially given concerns around bias and truthfulness. To address this issue, we present an end-to-end prototype-connecting interpretability techniques with user experience design-that seeks to make chatbots more transparent. We begin by showing evidence that a prominent open-source LLM has a "user model": examining the internal state of the system, we can extract data related to a user's age, gender, educational level, and socioeconomic status. Next, we describe the design of a dashboard that accompanies the chatbot interface, displaying this user model in real time. The dashboard can also be used to control the user model and the system's behavior. Finally, we discuss a study in which users conversed with the instrumented system. Our results suggest that users appreciate seeing internal states, which helped them expose biased behavior and increased their sense of control. Participants also made valuable suggestions that point to future directions for both design and machine learning research. The project page and video demo of our TalkTuner system are available at this https URL
GENIU: A Restricted Data Access Unlearning for Imbalanced Data
Authors: Chenhao Zhang, Shaofei Shen, Yawen Zhao, Weitong Tony Chen, Miao Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.07885
Pdf link: https://arxiv.org/pdf/2406.07885
Abstract With the increasing emphasis on data privacy, the significance of machine unlearning has grown substantially. Class unlearning, which involves enabling a trained model to forget data belonging to a specific class learned before, is important as classification tasks account for the majority of today's machine learning as a service (MLaaS). Retraining the model on the original data, excluding the data to be forgotten (a.k.a forgetting data), is a common approach to class unlearning. However, the availability of original data during the unlearning phase is not always guaranteed, leading to the exploration of class unlearning with restricted data access. While current unlearning methods with restricted data access usually generate proxy sample via the trained neural network classifier, they typically focus on training and forgetting balanced data. However, the imbalanced original data can cause trouble for these proxies and unlearning, particularly when the forgetting data consists predominantly of the majority class. To address this issue, we propose the GENerative Imbalanced Unlearning (GENIU) framework. GENIU utilizes a Variational Autoencoder (VAE) to concurrently train a proxy generator alongside the original model. These generated proxies accurately represent each class and are leveraged in the unlearning phase, eliminating the reliance on the original training data. To further mitigate the performance degradation resulting from forgetting the majority class, we introduce an in-batch tuning strategy that works with the generated proxies. GENIU is the first practical framework for class unlearning in imbalanced data settings and restricted data access, ensuring the preservation of essential information for future unlearning. Experimental results confirm the superiority of GENIU over existing methods, establishing its effectiveness in empirical scenarios.
Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis
Authors: Paolo Bonetti, Alberto Maria Metelli, Marcello Restelli
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.07991
Pdf link: https://arxiv.org/pdf/2406.07991
Abstract Multi-task learning (MTL) is a powerful machine learning paradigm designed to leverage shared knowledge across tasks to improve generalization and performance. Previous works have proposed approaches to MTL that can be divided into feature learning, focused on the identification of a common feature representation, and task clustering, where similar tasks are grouped together. In this paper, we propose an MTL approach at the intersection between task clustering and feature transformation based on a two-phase iterative aggregation of targets and features. First, we propose a bias-variance analysis for regression models with additive Gaussian noise, where we provide a general expression of the asymptotic bias and variance of a task, considering a linear regression trained on aggregated input features and an aggregated target. Then, we exploit this analysis to provide a two-phase MTL algorithm (NonLinCTFA). Firstly, this method partitions the tasks into clusters and aggregates each obtained group of targets with their mean. Then, for each aggregated task, it aggregates subsets of features with their mean in a dimensionality reduction fashion. In both phases, a key aspect is to preserve the interpretability of the reduced targets and features through the aggregation with the mean, which is further motivated by applications to Earth science. Finally, we validate the algorithms on synthetic data, showing the effect of different parameters and real-world datasets, exploring the validity of the proposed methodology on classical datasets, recent baselines, and Earth science applications.
Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning
Authors: Dariush Wahdany, Matthew Jagielski, Adam Dziedzic, Franziska Boenisch
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.08039
Pdf link: https://arxiv.org/pdf/2406.08039
Abstract Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DP-SGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy ($\varepsilon\le1)$ and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of pure DP. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL's high performance under strong privacy guarantees in challenging private learning setups.
Efficient Network Traffic Feature Sets for IoT Intrusion Detection
Authors: Miguel Silva, João Vitorino, Eva Maia, Isabel Praça
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2406.08042
Pdf link: https://arxiv.org/pdf/2406.08042
Abstract The use of Machine Learning (ML) models in cybersecurity solutions requires high-quality data that is stripped of redundant, missing, and noisy information. By selecting the most relevant features, data integrity and model efficiency can be significantly improved. This work evaluates the feature sets provided by a combination of different feature selection methods, namely Information Gain, Chi-Squared Test, Recursive Feature Elimination, Mean Absolute Deviation, and Dispersion Ratio, in multiple IoT network datasets. The influence of the smaller feature sets on both the classification performance and the training time of ML models is compared, with the aim of increasing the computational efficiency of IoT intrusion detection. Overall, the most impactful features of each dataset were identified, and the ML models obtained higher computational efficiency while preserving a good generalization, showing little to no difference between the sets.
A novel approach to graph distinction through GENEOs and permutants
Authors: Giovanni Bocchi, Massimo Ferri, Patrizio Frosini
Subjects: Subjects: Machine Learning (cs.LG); Group Theory (math.GR)
Arxiv link: https://arxiv.org/abs/2406.08045
Pdf link: https://arxiv.org/pdf/2406.08045
Abstract The theory of Group Equivariant Non-Expansive Operators (GENEOs) was initially developed in Topological Data Analysis for the geometric approximation of data observers, including their invariances and symmetries. This paper departs from that line of research and explores the use of GENEOs for distinguishing $r$-regular graphs up to isomorphisms. In doing so, we aim to test the capabilities and flexibility of these operators. Our experiments show that GENEOs offer a good compromise between efficiency and computational cost in comparing $r$-regular graphs, while their actions on data are easily interpretable. This supports the idea that GENEOs could be a general-purpose approach to discriminative problems in Machine Learning when some structural information about data and observers is explicitly given.
US College Net Price Prediction Comparing ML Regression Models
Authors: Zalak Patel, Ayushi Porwal, Kajal Bhandare, Jongwook Woo
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2406.08071
Pdf link: https://arxiv.org/pdf/2406.08071
Abstract This paper will illustrate the usage of Machine Learning algorithms on US College Scorecard datasets. For this paper, we will use our knowledge, research, and development of a predictive model to compare the results of all the models and predict the public and private net prices. This paper focuses on analyzing US College Scorecard data from data published on government websites. Our goal is to use four machine learning regression models to develop a predictive model to forecast the equitable net cost for every college, encompassing both public institutions and private, whether for-profit or nonprofit.
Balancing Molecular Information and Empirical Data in the Prediction of Physico-Chemical Properties
Authors: Johannes Zenn, Dominik Gond, Fabian Jirasek, Robert Bamler
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.08075
Pdf link: https://arxiv.org/pdf/2406.08075
Abstract Predicting the physico-chemical properties of pure substances and mixtures is a central task in thermodynamics. Established prediction methods range from fully physics-based ab-initio calculations, which are only feasible for very simple systems, over descriptor-based methods that use some information on the molecules to be modeled together with fitted model parameters (e.g., quantitative-structure-property relationship methods or classical group contribution methods), to representation-learning methods, which may, in extreme cases, completely ignore molecular descriptors and extrapolate only from existing data on the property to be modeled (e.g., matrix completion methods). In this work, we propose a general method for combining molecular descriptors with representation learning using the so-called expectation maximization algorithm from the probabilistic machine learning literature, which uses uncertainty estimates to trade off between the two approaches. The proposed hybrid model exploits chemical structure information using graph neural networks, but it automatically detects cases where structure-based predictions are unreliable, in which case it corrects them by representation-learning based predictions that can better specialize to unusual cases. The effectiveness of the proposed method is demonstrated using the prediction of activity coefficients in binary mixtures as an example. The results are compelling, as the method significantly improves predictive accuracy over the current state of the art, showcasing its potential to advance the prediction of physico-chemical properties in general.
Learnable & Interpretable Model Combination in Dynamic Systems Modeling
Authors: Tobias Thummerer, Lars Mikelsons
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.08093
Pdf link: https://arxiv.org/pdf/2406.08093
Abstract One of the core concepts in science, and something that happens intuitively in every-day dynamic systems modeling, is the combination of models or methods. Especially in dynamical systems modeling, often two or more structures are combined to obtain a more powerful or efficient architecture regarding a specific application (area). Further, even physical simulations are combined with machine learning architectures, to increase prediction accuracy or optimize the computational performance. In this work, we shortly discuss, which types of models are usually combined and propose a model interface that is capable of expressing a width variety of mixed algebraic, discrete and differential equation based models. Further, we examine different established, as well as new ways of combining these models from a system theoretical point of view and highlight two challenges - algebraic loops and local event affect functions in discontinuous models - that require a special approach. Finally, we propose a new wildcard topology, that is capable of describing the generic connection between two combined models in an easy to interpret fashion that can be learned as part of a gradient based optimization procedure. The contributions of this paper are highlighted at a proof of concept: Different connection topologies between two models are learned, interpreted and compared applying the proposed methodology and software implementation.
Scalable Defect Detection via Traversal on Code Graph
Authors: Zhengyao Liu, Xitong Zhong, Xingjing Deng, Shuo Hong, Xiang Gao, Hailong Sun
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2406.08098
Pdf link: https://arxiv.org/pdf/2406.08098
Abstract Detecting defects and vulnerabilities in the early stage has long been a challenge in software engineering. Static analysis, a technique that inspects code without execution, has emerged as a key strategy to address this challenge. Among recent advancements, the use of graph-based representations, particularly Code Property Graph (CPG), has gained traction due to its comprehensive depiction of code structure and semantics. Despite the progress, existing graph-based analysis tools still face performance and scalability issues. The main bottleneck lies in the size and complexity of CPG, which makes analyzing large codebases inefficient and memory-consuming. Also, query rules used by the current tools can be over-specific. Hence, we introduce QVoG, a graph-based static analysis platform for detecting defects and vulnerabilities. It employs a compressed CPG representation to maintain a reasonable graph size, thereby enhancing the overall query efficiency. Based on the CPG, it also offers a declarative query language to simplify the queries. Furthermore, it takes a step forward to integrate machine learning to enhance the generality of vulnerability detection. For projects consisting of 1,000,000+ lines of code, QVoG can complete analysis in approximately 15 minutes, as opposed to 19 minutes with CodeQL.
Confidence Interval Estimation of Predictive Performance in the Context of AutoML
Authors: Konstantinos Paraschakis, Andrea Castellani, Giorgos Borboudakis, Ioannis Tsamardinos
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2406.08099
Pdf link: https://arxiv.org/pdf/2406.08099
Abstract Any supervised machine learning analysis is required to provide an estimate of the out-of-sample predictive performance. However, it is imperative to also provide a quantification of the uncertainty of this performance in the form of a confidence or credible interval (CI) and not just a point estimate. In an AutoML setting, estimating the CI is challenging due to the ``winner's curse", i.e., the bias of estimation due to cross-validating several machine learning pipelines and selecting the winning one. In this work, we perform a comparative evaluation of 9 state-of-the-art methods and variants in CI estimation in an AutoML setting on a corpus of real and simulated datasets. The methods are compared in terms of inclusion percentage (does a 95\% CI include the true performance at least 95\% of the time), CI tightness (tighter CIs are preferable as being more informative), and execution time. The evaluation is the first one that covers most, if not all, such methods and extends previous work to imbalanced and small-sample tasks. In addition, we present a variant, called BBC-F, of an existing method (the Bootstrap Bias Correction, or BBC) that maintains the statistical properties of the BBC but is more computationally efficient. The results support that BBC-F and BBC dominate the other methods in all metrics measured.
Underneath the Numbers: Quantitative and Qualitative Gender Fairness in LLMs for Depression Prediction
Authors: Micol Spitale, Jiaee Cheong, Hatice Gunes
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2406.08183
Pdf link: https://arxiv.org/pdf/2406.08183
Abstract Recent studies show bias in many machine learning models for depression detection, but bias in LLMs for this task remains unexplored. This work presents the first attempt to investigate the degree of gender bias present in existing LLMs (ChatGPT, LLaMA 2, and Bard) using both quantitative and qualitative approaches. From our quantitative evaluation, we found that ChatGPT performs the best across various performance metrics and LLaMA 2 outperforms other LLMs in terms of group fairness metrics. As qualitative fairness evaluation remains an open research question we propose several strategies (e.g., word count, thematic analysis) to investigate whether and how a qualitative evaluation can provide valuable insights for bias analysis beyond what is possible with quantitative evaluation. We found that ChatGPT consistently provides a more comprehensive, well-reasoned explanation for its prediction compared to LLaMA 2. We have also identified several themes adopted by LLMs to qualitatively evaluate gender fairness. We hope our results can be used as a stepping stone towards future attempts at improving qualitative evaluation of fairness for LLMs especially for high-stakes tasks such as depression detection.
Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation
Authors: Christopher Bockel-Rickermann, Toon Vanderschueren, Tim Verdonck, Wouter Verbeke
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.08206
Pdf link: https://arxiv.org/pdf/2406.08206
Abstract Estimating conditional average dose responses (CADR) is an important but challenging problem. Estimators must correctly model the potentially complex relationships between covariates, interventions, doses, and outcomes. In recent years, the machine learning community has shown great interest in developing tailored CADR estimators that target specific challenges. Their performance is typically evaluated against other methods on (semi-) synthetic benchmark datasets. Our paper analyses this practice and shows that using popular benchmark datasets without further analysis is insufficient to judge model performance. Established benchmarks entail multiple challenges, whose impacts must be disentangled. Therefore, we propose a novel decomposition scheme that allows the evaluation of the impact of five distinct components contributing to CADR estimator performance. We apply this scheme to eight popular CADR estimators on four widely-used benchmark datasets, running nearly 1,500 individual experiments. Our results reveal that most established benchmarks are challenging for reasons different from their creators' claims. Notably, confounding, the key challenge tackled by most estimators, is not an issue in any of the considered datasets. We discuss the major implications of our findings and present directions for future research.
A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks
Authors: Sinclair Hudson, Sophia Jit, Boyue Caroline Hu, Marsha Chechik
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2406.08216
Pdf link: https://arxiv.org/pdf/2406.08216
Abstract Large Language Models (LLMs) are rapidly becoming ubiquitous both as stand-alone tools and as components of current and future software systems. To enable usage of LLMs in the high-stake or safety-critical systems of 2030, they need to undergo rigorous testing. Software Engineering (SE) research on testing Machine Learning (ML) components and ML-based systems has systematically explored many topics such as test input generation and robustness. We believe knowledge about tools, benchmarks, research and practitioner views related to LLM testing needs to be similarly organized. To this end, we present a taxonomy of LLM testing topics and conduct preliminary studies of state of the art and practice approaches to research, open-source tools and benchmarks for LLM testing, mapping results onto this taxonomy. Our goal is to identify gaps requiring more research and engineering effort and inspire a clearer communication between LLM practitioners and the SE research community.
The Importance of Positional Encoding Initialization in Transformers for Relational Reasoning
Authors: Takuya Ito, Luca Cocchi, Tim Klinger, Parikshit Ram, Murray Campbell, Luke Hearne
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.08272
Pdf link: https://arxiv.org/pdf/2406.08272
Abstract Relational reasoning refers to the ability to infer and understand the relations between multiple entities. In humans, this ability underpins many higher cognitive functions, such as problem solving and decision-making, and has been reliably linked to fluid intelligence. Despite machine learning models making impressive advances across various domains, such as natural language processing and vision, the extent to which such models can perform relational reasoning tasks remains unclear. Here we study the importance of positional encoding (PE) for relational reasoning in the Transformer, and find that a learnable PE outperforms all other commonly-used PEs (e.g., absolute, relative, rotary, etc.). Moreover, we find that when using a PE with a learnable parameter, the choice of initialization greatly influences the learned representations and its downstream generalization performance. Specifically, we find that a learned PE initialized from a small-norm distribution can 1) uncover ground-truth position information, 2) generalize in the presence of noisy inputs, and 3) produce behavioral patterns that are consistent with human performance. Our results shed light on the importance of learning high-performing and robust PEs during relational reasoning tasks, which will prove useful for tasks in which ground truth positions are not provided or not known.
Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization
Authors: Fengxiao Tang, Xiaonan Wang, Xun Yuan, Linfeng Luo, Ming Zhao, Nei Kato
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2406.08305
Pdf link: https://arxiv.org/pdf/2406.08305
Abstract Network device and system health management is the foundation of modern network operations and maintenance. Traditional health management methods, relying on expert identification or simple rule-based algorithms, struggle to cope with the dynamic heterogeneous networks (DHNs) environment. Moreover, current state-of-the-art distributed anomaly detection methods, which utilize specific machine learning techniques, lack multi-scale adaptivity for heterogeneous device information, resulting in unsatisfactory diagnostic accuracy for DHNs. In this paper, we develop an LLM-assisted end-to-end intelligent network health management framework. The framework first proposes a Multi-Scale Semanticized Anomaly Detection Model (MSADM), incorporating semantic rule trees with an attention mechanism to address the multi-scale anomaly detection problem in DHNs. Secondly, a chain-of-thought-based large language model is embedded in downstream to adaptively analyze the fault detection results and produce an analysis report with detailed fault information and optimization strategies. Experimental results show that the accuracy of our proposed MSADM for heterogeneous network entity anomaly detection is as high as 91.31\%.
A Survey of Pipeline Tools for Data Engineering
Authors: Anthony Mbata, Yaji Sripada, Mingjun Zhong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
Arxiv link: https://arxiv.org/abs/2406.08335
Pdf link: https://arxiv.org/pdf/2406.08335
Abstract Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tools to perform desired data engineering operations. While some tools are wholly or partly commercial, several open-source tools are available to perform expert-level data engineering tasks. This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions. These categories are Extract Transform Load/Extract Load Transform (ETL/ELT), pipelines for Data Integration, Ingestion, and Transformation, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. The survey also provides a broad outline of the utilization with examples within these broad groups and finally, a discussion is presented with case studies indicating the usage of pipeline tools for data engineering. The studies present some first-user application experiences with sample data, some complexities of the applied pipeline, and a summary note of approaches to using these tools to prepare data for machine learning.
Continuous-Time Digital Twin with Analogue Memristive Neural Ordinary Differential Equation Solver
Authors: Hegan Chen, Jichang Yang, Jia Chen, Songqi Wang, Shaocong Wang, Dingchen Wang, Xinyu Tian, Yifei Yu, Xi Chen, Yinan Lin, Yangu He, Xiaoshan Wu, Yi Li, Xinyuan Zhang, Ning Lin, Meng Xu, Yi Li, Xumeng Zhang, Zhongrui Wang, Han Wang, Dashan Shang, Qi Liu, Kwang-Ting Cheng, Ming Liu
Subjects: Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2406.08343
Pdf link: https://arxiv.org/pdf/2406.08343
Abstract Digital twins, the cornerstone of Industry 4.0, replicate real-world entities through computer models, revolutionising fields such as manufacturing management and industrial automation. Recent advances in machine learning provide data-driven methods for developing digital twins using discrete-time data and finite-depth models on digital computers. However, this approach fails to capture the underlying continuous dynamics and struggles with modelling complex system behaviour. Additionally, the architecture of digital computers, with separate storage and processing units, necessitates frequent data transfers and Analogue-Digital (A/D) conversion, thereby significantly increasing both time and energy costs. Here, we introduce a memristive neural ordinary differential equation (ODE) solver for digital twins, which is capable of capturing continuous-time dynamics and facilitates the modelling of complex systems using an infinite-depth model. By integrating storage and computation within analogue memristor arrays, we circumvent the von Neumann bottleneck, thus enhancing both speed and energy efficiency. We experimentally validate our approach by developing a digital twin of the HP memristor, which accurately extrapolates its nonlinear dynamics, achieving a 4.2-fold projected speedup and a 41.4-fold projected decrease in energy consumption compared to state-of-the-art digital hardware, while maintaining an acceptable error margin. Additionally, we demonstrate scalability through experimentally grounded simulations of Lorenz96 dynamics, exhibiting projected performance improvements of 12.6-fold in speed and 189.7-fold in energy efficiency relative to traditional digital approaches. By harnessing the capabilities of fully analogue computing, our breakthrough accelerates the development of digital twins, offering an efficient and rapid solution to meet the demands of Industry 4.0.
Improving Noise Robustness through Abstractions and its Impact on Machine Learning
Authors: Alfredo Ibias (1), Karol Capala (1), Varun Ravi Varma (1), Anna Drozdz (1), Jose Sousa (1) ((1) Personal Health Data Science, Sano - Centre for Computational Personalised Medicine)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.08428
Pdf link: https://arxiv.org/pdf/2406.08428
Abstract Noise is a fundamental problem in learning theory with huge effects in the application of Machine Learning (ML) methods, due to real world data tendency to be noisy. Additionally, introduction of malicious noise can make ML methods fail critically, as is the case with adversarial attacks. Thus, finding and developing alternatives to improve robustness to noise is a fundamental problem in ML. In this paper, we propose a method to deal with noise: mitigating its effect through the use of data abstractions. The goal is to reduce the effect of noise over the model's performance through the loss of information produced by the abstraction. However, this information loss comes with a cost: it can result in an accuracy reduction due to the missing information. First, we explored multiple methodologies to create abstractions, using the training dataset, for the specific case of numerical data and binary classification tasks. We also tested how these abstractions can affect robustness to noise with several experiments that explore the robustness of an Artificial Neural Network to noise when trained using raw data \emph{vs} when trained using abstracted data. The results clearly show that using abstractions is a viable approach for developing noise robust ML methods.
ORES-Inspect: A technology probe for machine learning audits on enwiki
Authors: Zachary Levonian, Lauren Hagen, Lu Li, Jada Lilleboe, Solvejg Wastvedt, Aaron Halfaker, Loren Terveen
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2406.08453
Pdf link: https://arxiv.org/pdf/2406.08453
Abstract Auditing the machine learning (ML) models used on Wikipedia is important for ensuring that vandalism-detection processes remain fair and effective. However, conducting audits is challenging because stakeholders have diverse priorities and assembling evidence for a model's [in]efficacy is technically complex. We designed an interface to enable editors to learn about and audit the performance of the ORES edit quality model. ORES-Inspect is an open-source web tool and a provocative technology probe for researching how editors think about auditing the many ML models used on Wikipedia. We describe the design of ORES-Inspect and our plans for further research with this system.
Nonconvex Federated Learning on Compact Smooth Submanifolds With Heterogeneous Data
Authors: Jiaojiao Zhang, Jiang Hu, Anthony Man-Cho So, Mikael Johansson
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2406.08465
Pdf link: https://arxiv.org/pdf/2406.08465
Abstract Many machine learning tasks, such as principal component analysis and low-rank matrix completion, give rise to manifold optimization problems. Although there is a large body of work studying the design and analysis of algorithms for manifold optimization in the centralized setting, there are currently very few works addressing the federated setting. In this paper, we consider nonconvex federated learning over a compact smooth submanifold in the setting of heterogeneous client data. We propose an algorithm that leverages stochastic Riemannian gradients and a manifold projection operator to improve computational efficiency, uses local updates to improve communication efficiency, and avoids client drift. Theoretically, we show that our proposed algorithm converges sub-linearly to a neighborhood of a first-order optimal solution by using a novel analysis that jointly exploits the manifold structure and properties of the loss functions. Numerical experiments demonstrate that our algorithm has significantly smaller computational and communication overhead than existing methods.
DafnyBench: A Benchmark for Formal Software Verification
Authors: Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2406.08467
Pdf link: https://arxiv.org/pdf/2406.08467
Abstract We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

qiaoyuet / arxiv_daily

New submissions for Thursday, 13 June 2024 (showing 438 of 438 entries ) #120

Keyword: differential privacy

Label Smoothing Improves Machine Unlearning

DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report)

Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning

Keyword: privacy

An Effective Approach to Scramble Multiple Diagnostic Imageries Using Chaos-Based Cryptography

Guardians of Anonymity: Exploring Tactics to Combat Cyber Threats in Onion Routing Environments

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Adversarial Machine Unlearning

Label Smoothing Improves Machine Unlearning

Regularizing and Aggregating Clients with Class Distribution for Personalized Federated Learning

Small Scale Data-Free Knowledge Distillation

GENIU: A Restricted Data Access Unlearning for Imbalanced Data

Graph Transductive Defense: a Two-Stage Defense for Graph Membership Inference Attacks

Ents: An Efficient Three-party Training Framework for Decision Trees by Communication Optimization

DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report)

Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey

A Federated Online Restless Bandit Framework for Cooperative Resource Allocation

Metaverse Identity: Core Principles and Critical Challenges

Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

Dataset Enhancement with Instance-Level Augmentations

A deep cut into Split Federated Self-supervised Learning

Designing Child-Centered Content Exposure and Moderation

Keyword: machine learning

Individual Packet Features are a Risk to Model Generalisation in ML-Based Intrusion Detection

A novel method for identifying rice seed purity based on hybrid machine learning algorithms

Equivariance via Minimal Frame Averaging for More Symmetries and Efficiency

When is an Embedding Model More Promising than Another?

Watching Swarm Dynamics from Above: A Framework for Advanced Object Tracking in Drone Videos

Adversarial Machine Unlearning

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles

Sustainable self-supervised learning for speech representations

Diagnosing and fixing common problems in Bayesian optimization for molecule design

Loss Gradient Gaussian Width based Generalization and Optimization Guarantees

Unleashing the Power of Transfer Learning Model for Sophisticated Insect Detection: Revolutionizing Insect Classification

Efficient Parallel Multi-Hop Reasoning: A Scalable Approach for Knowledge Graph Analysis

DualBind: A Dual-Loss Framework for Protein-Ligand Binding Affinity Prediction

Evolutionary Computation and Explainable AI: A Roadmap to Transparent Intelligent Systems

Scaling Manipulation Learning with Visual Kinematic Chain Prediction

Asymptotically Optimal Regret for Black-Box Predict-then-Optimize

A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects

Designing a Dashboard for Transparency and Control of Conversational AI

GENIU: A Restricted Data Access Unlearning for Imbalanced Data

Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis

Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning

Efficient Network Traffic Feature Sets for IoT Intrusion Detection

A novel approach to graph distinction through GENEOs and permutants

US College Net Price Prediction Comparing ML Regression Models

Balancing Molecular Information and Empirical Data in the Prediction of Physico-Chemical Properties

Learnable & Interpretable Model Combination in Dynamic Systems Modeling

Scalable Defect Detection via Traversal on Code Graph

Confidence Interval Estimation of Predictive Performance in the Context of AutoML

Underneath the Numbers: Quantitative and Qualitative Gender Fairness in LLMs for Depression Prediction

Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation

A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks

The Importance of Positional Encoding Initialization in Transformers for Relational Reasoning

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization

A Survey of Pipeline Tools for Data Engineering

Continuous-Time Digital Twin with Analogue Memristive Neural Ordinary Differential Equation Solver

Improving Noise Robustness through Abstractions and its Impact on Machine Learning

ORES-Inspect: A technology probe for machine learning audits on enwiki

Nonconvex Federated Learning on Compact Smooth Submanifolds With Heterogeneous Data

DafnyBench: A Benchmark for Formal Software Verification