New submissions for Tuesday, 21 May 2024 (showing 583 of 583 entries )

Keyword: differential privacy

Sketches-based join size estimation under local differential privacy

Authors: Meifan Zhang, Xin Liu, Lihua Yin
Subjects: Subjects: Databases (cs.DB); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2405.11419
Pdf link: https://arxiv.org/pdf/2405.11419
Abstract Join size estimation on sensitive data poses a risk of privacy leakage. Local differential privacy (LDP) is a solution to preserve privacy while collecting sensitive data, but it introduces significant noise when dealing with sensitive join attributes that have large domains. Employing probabilistic structures such as sketches is a way to handle large domains, but it leads to hash-collision errors. To achieve accurate estimations, it is necessary to reduce both the noise error and hash-collision error. To tackle the noise error caused by protecting sensitive join values with large domains, we introduce a novel algorithm called LDPJoinSketch for sketch-based join size estimation under LDP. Additionally, to address the inherent hash-collision errors in sketches under LDP, we propose an enhanced method called LDPJoinSketch+. It utilizes a frequency-aware perturbation mechanism that effectively separates high-frequency and low-frequency items without compromising privacy. The proposed methods satisfy LDP, and the estimation error is bounded. Experimental results show that our method outperforms existing methods, effectively enhancing the accuracy of join size estimation under LDP.
Securing Health Data on the Blockchain: A Differential Privacy and Federated Learning Framework
Authors: Daniel Commey, Sena Hounsinou, Garth V. Crosby
Subjects: Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11580
Pdf link: https://arxiv.org/pdf/2405.11580
Abstract This study proposes a framework to enhance privacy in Blockchain-based Internet of Things (BIoT) systems used in the healthcare sector. The framework addresses the challenge of leveraging health data for analytics while protecting patient privacy. To achieve this, the study integrates Differential Privacy (DP) with Federated Learning (FL) to protect sensitive health data collected by IoT nodes. The proposed framework utilizes dynamic personalization and adaptive noise distribution strategies to balance privacy and data utility. Additionally, blockchain technology ensures secure and transparent aggregation and storage of model updates. Experimental results on the SVHN dataset demonstrate that the proposed framework achieves strong privacy guarantees against various attack scenarios while maintaining high accuracy in health analytics tasks. For 15 rounds of federated learning with an epsilon value of 8.0, the model obtains an accuracy of 64.50%. The blockchain integration, utilizing Ethereum, Ganache, this http URL, and IPFS, exhibits an average transaction latency of around 6 seconds and consistent gas consumption across rounds, validating the practicality and feasibility of the proposed approach.
Decentralized Privacy Preservation for Critical Connections in Graphs
Authors: Conggai Li, Wei Ni, Ming Ding, Youyang Qu, Jianjun Chen, David Smith, Wenjie Zhang, Thierry Rakotoarivelo
Subjects: Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2405.11713
Pdf link: https://arxiv.org/pdf/2405.11713
Abstract Many real-world interconnections among entities can be characterized as graphs. Collecting local graph information with balanced privacy and data utility has garnered notable interest recently. This paper delves into the problem of identifying and protecting critical information of entity connections for individual participants in a graph based on cohesive subgraph searches. This problem has not been addressed in the literature. To address the problem, we propose to extract the critical connections of a queried vertex using a fortress-like cohesive subgraph model known as $p$-cohesion. A user's connections within a fortress are obfuscated when being released, to protect critical information about the user. Novel merit and penalty score functions are designed to measure each participant's critical connections in the minimal $p$-cohesion, facilitating effective identification of the connections. We further propose to preserve the privacy of a vertex enquired by only protecting its critical connections when responding to queries raised by data collectors. We prove that, under the decentralized differential privacy (DDP) mechanism, one's response satisfies $(\varepsilon, \delta)$-DDP when its critical connections are protected while the rest remains unperturbed. The effectiveness of our proposed method is demonstrated through extensive experiments on real-life graph datasets.
Keyword: privacy

Private Data Leakage in Federated Human Activity Recognition for Wearable Healthcare Devices
Authors: Kongyang Chen, Dongping Zhang, Bing Mi
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2405.10979
Pdf link: https://arxiv.org/pdf/2405.10979
Abstract Wearable wristband or watch can be utilized for health monitoring, such as determining the user's activity status based on behavior and providing reasonable exercise recommendations. Obviously, the individual data perception and local computing capabilities of a single wearable device are limited, making it difficult to train a robust user behavior recognition model. Typically, joint modeling requires the collaboration of multiple wearable devices. An appropriate research approach is Federated Human Activity Recognition (HAR), which can train a global model without uploading users' local exercise data. Nevertheless, recent studies indicate that federated learning still faces serious data security and privacy issues. To the best of our knowledge, there is no existing research on membership information leakage in Federated HAR. Therefore, our study aims to investigate the joint modeling process of multiple wearable devices for user behavior recognition, with a focus on analyzing the privacy leakage issues of wearable data. In our system, we consider a federated learning architecture consisting of $N$ wearable device users and a parameter server. The parameter server distributes the initial model to each user, who independently perceives their motion sensor data, conducts local model training, and uploads it to the server. The server aggregates these local models until convergence. In the federated learning architecture, the server may be curious and seek to obtain privacy information about relevant users from the model parameters. Hence, we consider membership inference attacks based on malicious servers, which exploit differences in model generalization across different client data. Through experimentation deployed on five publicly available HAR datasets, we demonstrate that the accuracy of malicious server membership inference reaches 92\%.
Learnable Privacy Neurons Localization in Language Models
Authors: Ruizhe Chen, Tianxiang Hu, Yang Feng, Zuozhu Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2405.10989
Pdf link: https://arxiv.org/pdf/2405.10989
Abstract Concerns regarding Large Language Models (LLMs) to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks. However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.
"What do you want from theory alone?" Experimenting with Tight Auditing of Differentially Private Synthetic Data Generation
Authors: Meenatchi Sundaram Muthu Selva Annamalai, Georgi Ganev, Emiliano De Cristofaro
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2405.10994
Pdf link: https://arxiv.org/pdf/2405.10994
Abstract Differentially private synthetic data generation (DP-SDG) algorithms are used to release datasets that are structurally and statistically similar to sensitive data while providing formal bounds on the information they leak. However, bugs in algorithms and implementations may cause the actual information leakage to be higher. This prompts the need to verify whether the theoretical guarantees of state-of-the-art DP-SDG implementations also hold in practice. We do so via a rigorous auditing process: we compute the information leakage via an adversary playing a distinguishing game and running membership inference attacks (MIAs). If the leakage observed empirically is higher than the theoretical bounds, we identify a DP violation; if it is non-negligibly lower, the audit is loose. We audit six DP-SDG implementations using different datasets and threat models and find that black-box MIAs commonly used against DP-SDGs are severely limited in power, yielding remarkably loose empirical privacy estimates. We then consider MIAs in stronger threat models, i.e., passive and active white-box, using both existing and newly proposed attacks. Overall, we find that, currently, we do not only need white-box MIAs but also worst-case datasets to tightly estimate the privacy leakage from DP-SDGs. Finally, we show that our automated auditing procedure finds both known DP violations (in 4 out of the 6 implementations) as well as a new one in the DPWGAN implementation that was successfully submitted to the NIST DP Synthetic Data Challenge. The source code needed to reproduce our experiments is available from this https URL.
NTTSuite: Number Theoretic Transform Benchmarks for Accelerating Encrypted Computation
Authors: Juran Ding, Yuanzhe Liu, Lingbin Sun, Brandon Reagen
Subjects: Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2405.11353
Pdf link: https://arxiv.org/pdf/2405.11353
Abstract Privacy concerns have thrust privacy-preserving computation into the spotlight. Homomorphic encryption (HE) is a cryptographic system that enables computation to occur directly on encrypted data, providing users with strong privacy (and security) guarantees while using the same services they enjoy today unprotected. While promising, HE has seen little adoption due to extremely high computational overheads, rendering it impractical. Homomorphic encryption (HE) is a cryptographic system that enables computation to occur directly on encrypted data. In this paper we develop a benchmark suite, named NTTSuite, to enable researchers to better address these overheads by studying the primary source of HE's slowdown: the number theoretic transform (NTT). NTTSuite constitutes seven unique NTT algorithms with support for CPUs (C++), GPUs (CUDA), and custom hardware (Catapult HLS).In addition, we propose optimizations to improve the performance of NTT running on FPGAs. We find our implementation outperforms the state-of-the-art by 30%.
Sketches-based join size estimation under local differential privacy
Authors: Meifan Zhang, Xin Liu, Lihua Yin
Subjects: Subjects: Databases (cs.DB); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2405.11419
Pdf link: https://arxiv.org/pdf/2405.11419
Abstract Join size estimation on sensitive data poses a risk of privacy leakage. Local differential privacy (LDP) is a solution to preserve privacy while collecting sensitive data, but it introduces significant noise when dealing with sensitive join attributes that have large domains. Employing probabilistic structures such as sketches is a way to handle large domains, but it leads to hash-collision errors. To achieve accurate estimations, it is necessary to reduce both the noise error and hash-collision error. To tackle the noise error caused by protecting sensitive join values with large domains, we introduce a novel algorithm called LDPJoinSketch for sketch-based join size estimation under LDP. Additionally, to address the inherent hash-collision errors in sketches under LDP, we propose an enhanced method called LDPJoinSketch+. It utilizes a frequency-aware perturbation mechanism that effectively separates high-frequency and low-frequency items without compromising privacy. The proposed methods satisfy LDP, and the estimation error is bounded. Experimental results show that our method outperforms existing methods, effectively enhancing the accuracy of join size estimation under LDP.
A GAN-Based Data Poisoning Attack Against Federated Learning Systems and Its Countermeasure
Authors: Wei Sun, Bo Gao, Ke Xiong, Yuwei Wang, Pingyi Fan, Khaled Ben Letaief
Subjects: Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2405.11440
Pdf link: https://arxiv.org/pdf/2405.11440
Abstract As a distributed machine learning paradigm, federated learning (FL) is collaboratively carried out on privately owned datasets but without direct data access. Although the original intention is to allay data privacy concerns, "available but not visible" data in FL potentially brings new security threats, particularly poisoning attacks that target such "not visible" local data. Initial attempts have been made to conduct data poisoning attacks against FL systems, but cannot be fully successful due to their high chance of causing statistical anomalies. To unleash the potential for truly "invisible" attacks and build a more deterrent threat model, in this paper, a new data poisoning attack model named VagueGAN is proposed, which can generate seemingly legitimate but noisy poisoned data by untraditionally taking advantage of generative adversarial network (GAN) variants. Capable of manipulating the quality of poisoned data on demand, VagueGAN enables to trade-off attack effectiveness and stealthiness. Furthermore, a cost-effective countermeasure named Model Consistency-Based Defense (MCD) is proposed to identify GAN-poisoned data or models after finding out the consistency of GAN outputs. Extensive experiments on multiple datasets indicate that our attack method is generally much more stealthy as well as more effective in degrading FL performance with low complexity. Our defense method is also shown to be more competent in identifying GAN-poisoned data or models. The source codes are publicly available at \href{this https URL}{this https URL}.
Diffusion-Based Hierarchical Image Steganography
Authors: Youmin Xu, Xuanyu Zhang, Jiwen Yu, Chong Mou, Xiandong Meng, Jian Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2405.11523
Pdf link: https://arxiv.org/pdf/2405.11523
Abstract This paper introduces Hierarchical Image Steganography, a novel method that enhances the security and capacity of embedding multiple images into a single container using diffusion models. HIS assigns varying levels of robustness to images based on their importance, ensuring enhanced protection against manipulation. It adaptively exploits the robustness of the Diffusion Model alongside the reversibility of the Flow Model. The integration of Embed-Flow and Enhance-Flow improves embedding efficiency and image recovery quality, respectively, setting HIS apart from conventional multi-image steganography techniques. This innovative structure can autonomously generate a container image, thereby securely and efficiently concealing multiple images and text. Rigorous subjective and objective evaluations underscore our advantage in analytical resistance, robustness, and capacity, illustrating its expansive applicability in content safeguarding and privacy fortification.
Overcoming Data and Model Heterogeneities in Decentralized Federated Learning via Synthetic Anchors
Authors: Chun-Yin Huang, Kartik Srinivas, Xin Zhang, Xiaoxiao Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2405.11525
Pdf link: https://arxiv.org/pdf/2405.11525
Abstract Conventional Federated Learning (FL) involves collaborative training of a global model while maintaining user data privacy. One of its branches, decentralized FL, is a serverless network that allows clients to own and optimize different local models separately, which results in saving management and communication resources. Despite the promising advancements in decentralized FL, it may reduce model generalizability due to lacking a global model. In this scenario, managing data and model heterogeneity among clients becomes a crucial problem, which poses a unique challenge that must be overcome: How can every client's local model learn generalizable representation in a decentralized manner? To address this challenge, we propose a novel Decentralized FL technique by introducing Synthetic Anchors, dubbed as DeSA. Based on the theory of domain adaptation and Knowledge Distillation (KD), we theoretically and empirically show that synthesizing global anchors based on raw data distribution facilitates mutual knowledge transfer. We further design two effective regularization terms for local training: 1) REG loss that regularizes the distribution of the client's latent embedding with the anchors and 2) KD loss that enables clients to learn from others. Through extensive experiments on diverse client data distributions, we showcase the effectiveness of DeSA in enhancing both inter- and intra-domain accuracy of each client.
Securing Health Data on the Blockchain: A Differential Privacy and Federated Learning Framework
Authors: Daniel Commey, Sena Hounsinou, Garth V. Crosby
Subjects: Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11580
Pdf link: https://arxiv.org/pdf/2405.11580
Abstract This study proposes a framework to enhance privacy in Blockchain-based Internet of Things (BIoT) systems used in the healthcare sector. The framework addresses the challenge of leveraging health data for analytics while protecting patient privacy. To achieve this, the study integrates Differential Privacy (DP) with Federated Learning (FL) to protect sensitive health data collected by IoT nodes. The proposed framework utilizes dynamic personalization and adaptive noise distribution strategies to balance privacy and data utility. Additionally, blockchain technology ensures secure and transparent aggregation and storage of model updates. Experimental results on the SVHN dataset demonstrate that the proposed framework achieves strong privacy guarantees against various attack scenarios while maintaining high accuracy in health analytics tasks. For 15 rounds of federated learning with an epsilon value of 8.0, the model obtains an accuracy of 64.50%. The blockchain integration, utilizing Ethereum, Ganache, this http URL, and IPFS, exhibits an average transaction latency of around 6 seconds and consistent gas consumption across rounds, validating the practicality and feasibility of the proposed approach.
Trust, Because You Can't Verify:Privacy and Security Hurdles in Education Technology Acquisition Practices
Authors: Easton Kelso, Ananta Soneji, Sazzadur Rahaman, Yan Soshitaishvili, Rakibul Hasan
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2405.11712
Pdf link: https://arxiv.org/pdf/2405.11712
Abstract The education technology (EdTech) landscape is expanding rapidly in higher education institutes (HEIs). This growth brings enormous complexity. Protecting the extensive data collected by these tools is crucial for HEIs. Privacy incidents of data breaches and misuses can have dire security and privacy consequences on the data subjects, particularly students, who are often compelled to use these tools. This urges an in-depth understanding of HEI and EdTech vendor dynamics, which is largely understudied. To address this gap, we conduct a semi-structured interview study with 13 participants who are in the EdTech leadership roles at seven HEIs. Our study uncovers the EdTech acquisition process in the HEI context, the consideration of security and privacy issues throughout that process, the pain points of HEI personnel in establishing adequate security and privacy protection mechanisms in service contracts, and their struggle in holding vendors accountable due to a lack of visibility into their system and power-asymmetry, among other reasons. We discuss certain observations about the status quo and conclude with recommendations to improve the situation.
Decentralized Privacy Preservation for Critical Connections in Graphs
Authors: Conggai Li, Wei Ni, Ming Ding, Youyang Qu, Jianjun Chen, David Smith, Wenjie Zhang, Thierry Rakotoarivelo
Subjects: Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2405.11713
Pdf link: https://arxiv.org/pdf/2405.11713
Abstract Many real-world interconnections among entities can be characterized as graphs. Collecting local graph information with balanced privacy and data utility has garnered notable interest recently. This paper delves into the problem of identifying and protecting critical information of entity connections for individual participants in a graph based on cohesive subgraph searches. This problem has not been addressed in the literature. To address the problem, we propose to extract the critical connections of a queried vertex using a fortress-like cohesive subgraph model known as $p$-cohesion. A user's connections within a fortress are obfuscated when being released, to protect critical information about the user. Novel merit and penalty score functions are designed to measure each participant's critical connections in the minimal $p$-cohesion, facilitating effective identification of the connections. We further propose to preserve the privacy of a vertex enquired by only protecting its critical connections when responding to queries raised by data collectors. We prove that, under the decentralized differential privacy (DDP) mechanism, one's response satisfies $(\varepsilon, \delta)$-DDP when its critical connections are protected while the rest remains unperturbed. The effectiveness of our proposed method is demonstrated through extensive experiments on real-life graph datasets.
Fed-Credit: Robust Federated Learning with Credibility Management
Authors: Jiayan Chen, Zhirong Qian, Tianhui Meng, Xitong Gao, Tian Wang, Weijia Jia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2405.11758
Pdf link: https://arxiv.org/pdf/2405.11758
Abstract Aiming at privacy preservation, Federated Learning (FL) is an emerging machine learning approach enabling model training on decentralized devices or data sources. The learning mechanism of FL relies on aggregating parameter updates from individual clients. However, this process may pose a potential security risk due to the presence of malicious devices. Existing solutions are either costly due to the use of compute-intensive technology, or restrictive for reasons of strong assumptions such as the prior knowledge of the number of attackers and how they attack. Few methods consider both privacy constraints and uncertain attack scenarios. In this paper, we propose a robust FL approach based on the credibility management scheme, called Fed-Credit. Unlike previous studies, our approach does not require prior knowledge of the nodes and the data distribution. It maintains and employs a credibility set, which weighs the historical clients' contributions based on the similarity between the local models and global model, to adjust the global model update. The subtlety of Fed-Credit is that the time decay and attitudinal value factor are incorporated into the dynamic adjustment of the reputation weights and it boasts a computational complexity of O(n) (n is the number of the clients). We conducted extensive experiments on the MNIST and CIFAR-10 datasets under 5 types of attacks. The results exhibit superior accuracy and resilience against adversarial attacks, all while maintaining comparatively low computational complexity. Among these, on the Non-IID CIFAR-10 dataset, our algorithm exhibited performance enhancements of 19.5% and 14.5%, respectively, in comparison to the state-of-the-art algorithm when dealing with two types of data poisoning attacks.
FedCAda: Adaptive Client-Side Optimization for Accelerated and Stable Federated Learning
Authors: Liuzhi Zhou, Yu He, Kun Zhai, Xiang Liu, Sen Liu, Xingjun Ma, Guangnan Ye, Yu-Gang Jiang, Hongfeng Chai
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2405.11811
Pdf link: https://arxiv.org/pdf/2405.11811
Abstract Federated learning (FL) has emerged as a prominent approach for collaborative training of machine learning models across distributed clients while preserving data privacy. However, the quest to balance acceleration and stability becomes a significant challenge in FL, especially on the client-side. In this paper, we introduce FedCAda, an innovative federated client adaptive algorithm designed to tackle this challenge. FedCAda leverages the Adam algorithm to adjust the correction process of the first moment estimate $m$ and the second moment estimate $v$ on the client-side and aggregate adaptive algorithm parameters on the server-side, aiming to accelerate convergence speed and communication efficiency while ensuring stability and performance. Additionally, we investigate several algorithms incorporating different adjustment functions. This comparative analysis revealed that due to the limited information contained within client models from other clients during the initial stages of federated learning, more substantial constraints need to be imposed on the parameters of the adaptive algorithm. As federated learning progresses and clients gather more global information, FedCAda gradually diminishes the impact on adaptive parameters. These findings provide insights for enhancing the robustness and efficiency of algorithmic improvements. Through extensive experiments on computer vision (CV) and natural language processing (NLP) datasets, we demonstrate that FedCAda outperforms the state-of-the-art methods in terms of adaptability, convergence, stability, and overall performance. This work contributes to adaptive algorithms for federated learning, encouraging further exploration.
Federated Learning with Incomplete Sensing Modalities
Authors: Adiba Orzikulova, Jaehyun Kwak, Jaemin Shin, Sung-Ju Lee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11828
Pdf link: https://arxiv.org/pdf/2405.11828
Abstract Many mobile sensing applications utilize data from various modalities, including motion and physiological sensors in mobile and wearable devices. Federated Learning (FL) is particularly suitable for these applications thanks to its privacy-preserving feature. However, challenges such as limited battery life, poor network conditions, and sensor malfunctions can restrict the use of all available modalities for local model training. Additionally, existing multimodal FL systems also struggle with scalability and efficiency as the number of modality sources increases. To address these issues, we introduce FLISM, a framework designed to enable multimodal FL with incomplete modalities. FLISM leverages simulation technique to learn robust representations that can handle missing modalities and transfers model knowledge across clients with varying set of modalities. The evaluation results using three real-world datasets and simulations demonstrate FLISM's effective balance between model performance and system efficiency. It shows an average improvement of .067 in F1-score, while also reducing communication (2.69x faster) and computational (2.28x more efficient) overheads compared to existing methods addressing incomplete modalities. Moreover, in simulated scenarios involving tasks with a larger number of modalities, FLISM achieves a significant speedup of 3.23x~85.10x in communication and 3.73x~32.29x in computational efficiency.
Information Leakage from Embedding in Large Language Models
Authors: Zhipeng Wang, Anda Cheng, Yinggui Wang, Lei Wang
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2405.11916
Pdf link: https://arxiv.org/pdf/2405.11916
Abstract The widespread adoption of large language models (LLMs) has raised concerns regarding data privacy. This study aims to investigate the potential for privacy invasion through input reconstruction attacks, in which a malicious model provider could potentially recover user inputs from embeddings. We first propose two base methods to reconstruct original texts from a model's hidden states. We find that these two methods are effective in attacking the embeddings from shallow layers, but their effectiveness decreases when attacking embeddings from deeper layers. To address this issue, we then present Embed Parrot, a Transformer-based method, to reconstruct input from embeddings in deep layers. Our analysis reveals that Embed Parrot effectively reconstructs original inputs from the hidden states of ChatGLM-6B and Llama2-7B, showcasing stable performance across various token lengths and data distributions. To mitigate the risk of privacy breaches, we introduce a defense mechanism to deter exploitation of the embedding reconstruction process. Our findings emphasize the importance of safeguarding user privacy in distributed learning systems and contribute valuable insights to enhance the security protocols within such environments.
Data Augmentation for Text-based Person Retrieval Using Large Language Models
Authors: Zheng Li, Lijia Si, Caili Guo, Yang Yang, Qiushi Cao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2405.11971
Pdf link: https://arxiv.org/pdf/2405.11971
Abstract Text-based Person Retrieval (TPR) aims to retrieve person images that match the description given a text query. The performance improvement of the TPR model relies on high-quality data for supervised training. However, it is difficult to construct a large-scale, high-quality TPR dataset due to expensive annotation and privacy protection. Recently, Large Language Models (LLMs) have approached or even surpassed human performance on many NLP tasks, creating the possibility to expand high-quality TPR datasets. This paper proposes an LLM-based Data Augmentation (LLM-DA) method for TPR. LLM-DA uses LLMs to rewrite the text in the current TPR dataset, achieving high-quality expansion of the dataset concisely and efficiently. These rewritten texts are able to increase the diversity of vocabulary and sentence structure while retaining the original key concepts and semantic information. In order to alleviate the hallucinations of LLMs, LLM-DA introduces a Text Faithfulness Filter (TFF) to filter out unfaithful rewritten text. To balance the contributions of original text and augmented text, a Balanced Sampling Strategy (BSS) is proposed to control the proportion of original text and augmented text used for training. LLM-DA is a plug-and-play method that can be easily integrated into various TPR models. Comprehensive experiments on three TPR benchmarks show that LLM-DA can improve the retrieval performance of current TPR models.
A Stochastic Sampling Approach to Privacy
Authors: Chuanghong Weng, Ehsan Nekouei
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2405.11975
Pdf link: https://arxiv.org/pdf/2405.11975
Abstract This paper proposes an optimal stochastic sampling approach to privacy, in which a sensor observes a process which is correlated to private information. In out set-up, a sampler decides to keep or discard the sensor's observations. The kept samples are shared with an adversary who might attempt to infer the private process based on the sampler's output. The privacy leakages are captured with the mutual information between the private process and sampler's output. We cast the optimal sampling design as an optimization problem with two objectives: (i) minimizing the reconstruction error of the observed process using the sampler's output, (ii) reducing the privacy leakages. We first show the optimal reconstruction policy is deterministic and can be obtained by solving a one-step optimization problem at each time step. We also derive the optimality equations of the privacy-sampler for a general class of processes via the dynamic decomposition method, and show the sampler controls the adversary's belief about the private input. Also, we propose a simplified design for linear Gaussian processes by restricting the sampling policy to a special collection. We show that the optimal reconstruction of the system state and the private process is similar to Kalman filter in the linear Gaussian case, and the objective of the sampler design problem can be analytically expressed based on a conditional mean and covariance matrix. Furthermore, we develop an numerical algorithm to optimize the sampling and reconstruction policies, wherein the policy gradient theorem for the optimal sampling design is derived based on the implicit function theorem. Finally, we verify our design and show it capabilities in state reconstruction, privacy protection and data size reduction via simulations.
Attribute-Based Authentication in Secure Group Messaging for Distributed Environments
Authors: David Soler (1), Carlos Dafonte (1), Manuel Fernández-Veiga (2), Ana Fernández Vilas (2), Francisco J. Nóvoa (1) ((1) CITIC, Universidade da Coruňa, A Coruňa, Spain, (2) atlanTTic, Universidade de Vigo, Vigo, Spain)
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2405.12042
Pdf link: https://arxiv.org/pdf/2405.12042
Abstract Messaging Layer security (MLS) and its underlying Continuous Group Key Agreement (CGKA) protocol allows a group of users to share a cryptographic secret in a dynamic manner, such that the secret is modified in member insertions and deletions. Although this flexibility makes MLS ideal for implementations in distributed environments, a number of issues need to be overcome. Particularly, the use of digital certificates for authentication in a group goes against the group members' privacy. In this work we provide an alternative method of authentication in which the solicitors, instead of revealing their identity, only need to prove possession of certain attributes, dynamically defined by the group, to become a member. Instead of digital certificates, we employ Attribute-Based Credentials accompanied with Selective Disclosure in order to reveal the minimum required amount of information and to prevent attackers from linking the activity of a user through multiple groups. We formally define a CGKA variant named Attribute-Authenticated Continuous Group Key Agreement (AA-CGKA) and provide security proofs for its properties of Requirement Integrity, Unforgeability and Unlinkability. We also provide guidelines for an integration of our construction in MLS.
Keyword: machine learning

Predictive Energy Management for Battery Electric Vehicles with Hybrid Models
Authors: Yu-Wen Huang, Christian Prehofer, William Lindskog, Ron Puts, Pietro Mosca, Göran Kauermann
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2405.10984
Pdf link: https://arxiv.org/pdf/2405.10984
Abstract This paper addresses the problem of predicting the energy consumption for the drivers of Battery electric vehicles (BEVs). Several external factors (e.g., weather) are shown to have huge impacts on the energy consumption of a vehicle besides the vehicle or powertrain dynamics. Thus, it is challenging to take all of those influencing variables into consideration. The proposed approach is based on a hybrid model which improves the prediction accuracy of energy consumption of BEVs. The novelty of this approach is to combine a physics-based simulation model, which captures the basic vehicle and powertrain dynamics, with a data-driven model. The latter accounts for other external influencing factors neglected by the physical simulation model, using machine learning techniques, such as generalized additive mixed models, random forests and boosting. The hybrid modeling method is evaluated with a real data set from TUM and the hybrid models were shown that decrease the average prediction error from 40% of the pure physics model to 10%.
GraSS: Combining Graph Neural Networks with Expert Knowledge for SAT Solver Selection
Authors: Zhanguang Zhang, Didier Chetelat, Joseph Cotnareanu, Amur Ghose, Wenyi Xiao, Hui-Ling Zhen, Yingxue Zhang, Jianye Hao, Mark Coates, Mingxuan Yuan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2405.11024
Pdf link: https://arxiv.org/pdf/2405.11024
Abstract Boolean satisfiability (SAT) problems are routinely solved by SAT solvers in real-life applications, yet solving time can vary drastically between solvers for the same instance. This has motivated research into machine learning models that can predict, for a given SAT instance, which solver to select among several options. Existing SAT solver selection methods all rely on some hand-picked instance features, which are costly to compute and ignore the structural information in SAT graphs. In this paper we present GraSS, a novel approach for automatic SAT solver selection based on tripartite graph representations of instances and a heterogeneous graph neural network (GNN) model. While GNNs have been previously adopted in other SAT-related tasks, they do not incorporate any domain-specific knowledge and ignore the runtime variation introduced by different clause orders. We enrich the graph representation with domain-specific decisions, such as novel node feature design, positional encodings for clauses in the graph, a GNN architecture tailored to our tripartite graphs and a runtime-sensitive loss function. Through extensive experiments, we demonstrate that this combination of raw representations and domain-specific choices leads to improvements in runtime for a pool of seven state-of-the-art solvers on both an industrial circuit design benchmark, and on instances from the 20-year Anniversary Track of the 2022 SAT Competition.
Safety in Graph Machine Learning: Threats and Safeguards
Authors: Song Wang, Yushun Dong, Binchi Zhang, Zihan Chen, Xingbo Fu, Yinhan He, Cong Shen, Chuxu Zhang, Nitesh V. Chawla, Jundong Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11034
Pdf link: https://arxiv.org/pdf/2405.11034
Abstract Graph Machine Learning (Graph ML) has witnessed substantial advancements in recent years. With their remarkable ability to process graph-structured data, Graph ML techniques have been extensively utilized across diverse applications, including critical domains like finance, healthcare, and transportation. Despite their societal benefits, recent research highlights significant safety concerns associated with the widespread use of Graph ML models. Lacking safety-focused designs, these models can produce unreliable predictions, demonstrate poor generalizability, and compromise data confidentiality. In high-stakes scenarios such as financial fraud detection, these vulnerabilities could jeopardize both individuals and society at large. Therefore, it is imperative to prioritize the development of safety-oriented Graph ML models to mitigate these risks and enhance public confidence in their applications. In this survey paper, we explore three critical aspects vital for enhancing safety in Graph ML: reliability, generalizability, and confidentiality. We categorize and analyze threats to each aspect under three headings: model threats, data threats, and attack threats. This novel taxonomy guides our review of effective strategies to protect against these threats. Our systematic review lays a groundwork for future research aimed at developing practical, safety-centered Graph ML models. Furthermore, we highlight the significance of safe Graph ML practices and suggest promising avenues for further investigation in this crucial area.
A Comparative Study of Garment Draping Techniques
Authors: Prerana Achar, Mayank Patel, Anushka Mulik, Neha Katre, Stevina Dias, Chirag Raman
Subjects: Subjects: Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11056
Pdf link: https://arxiv.org/pdf/2405.11056
Abstract We present a comparison review that evaluates popular techniques for garment draping for 3D fashion design, virtual try-ons, and animations. A comparative study is performed between various methods for garment draping of clothing over the human body. These include numerous models, such as physics and machine learning based techniques, collision handling, and more. Performance evaluations and trade-offs are discussed to ensure informed decision-making when choosing the most appropriate approach. These methods aim to accurately represent deformations and fine wrinkles of digital garments, considering the factors of data requirements, and efficiency, to produce realistic results. The research can be insightful to researchers, designers, and developers in visualizing dynamic multi-layered 3D clothing.
Dynamic Embeddings with Task-Oriented prompting
Authors: Allmin Balloccu, Jack Zhang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11117
Pdf link: https://arxiv.org/pdf/2405.11117
Abstract This paper introduces Dynamic Embeddings with Task-Oriented prompting (DETOT), a novel approach aimed at improving the adaptability and efficiency of machine learning models by implementing a flexible embedding layer. Unlike traditional static embeddings [14], DETOT dynamically adjusts embeddings based on task-specific requirements and performance feedback, optimizing input data representation for individual tasks [4]. This method enhances both accuracy and computational performance by tailoring the representation layer to meet the unique needs of each task. The structure of DETOT is detailed, highlighting its task-specific adaptation, continuous feedback loop, and mechanisms for preventing overfitting. Empirical evaluations demonstrate its superiority over existing methods.
Enhancing Automata Learning with Statistical Machine Learning: A Network Security Case Study
Authors: Negin Ayoughi, Shiva Nejati, Mehrdad Sabetzadeh, Patricio Saavedra
Subjects: Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2405.11141
Pdf link: https://arxiv.org/pdf/2405.11141
Abstract Intrusion detection systems are crucial for network security. Verification of these systems is complicated by various factors, including the heterogeneity of network platforms and the continuously changing landscape of cyber threats. In this paper, we use automata learning to derive state machines from network-traffic data with the objective of supporting behavioural verification of intrusion detection systems. The most innovative aspect of our work is addressing the inability to directly apply existing automata learning techniques to network-traffic data due to the numeric nature of such data. Specifically, we use interpretable machine learning (ML) to partition numeric ranges into intervals that strongly correlate with a system's decisions regarding intrusion detection. These intervals are subsequently used to abstract numeric ranges before automata learning. We apply our ML-enhanced automata learning approach to a commercial network intrusion detection system developed by our industry partner, RabbitRun Technologies. Our approach results in an average 67.5% reduction in the number of states and transitions of the learned state machines, while achieving an average 28% improvement in accuracy compared to using expertise-based numeric data abstraction. Furthermore, the resulting state machines help practitioners in verifying system-level security requirements and exploring previously unknown system behaviours through model checking and temporal query checking. We make our implementation and experimental data available online.
Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines
Authors: Chaokun Chang, Eric Lo, Chunxiao Ye
Subjects: Subjects: Databases (cs.DB); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11191
Pdf link: https://arxiv.org/pdf/2405.11191
Abstract Machine learning inference pipelines commonly encountered in data science and industries often require real-time responsiveness due to their user-facing nature. However, meeting this requirement becomes particularly challenging when certain input features require aggregating a large volume of data online. Recent literature on interpretable machine learning reveals that most machine learning models exhibit a notable degree of resilience to variations in input. This suggests that machine learning models can effectively accommodate approximate input features with minimal discernible impact on accuracy. In this paper, we introduce Biathlon, a novel ML serving system that leverages the inherent resilience of models and determines the optimal degree of approximation for each aggregation feature. This approach enables maximum speedup while ensuring a guaranteed bound on accuracy loss. We evaluate Biathlon on real pipelines from both industry applications and data science competitions, demonstrating its ability to meet real-time latency requirements by achieving 5.3x to 16.6x speedup with almost no accuracy loss.
Trustworthy Actionable Perturbations
Authors: Jesse Friedbaum, Sudarshan Adiga, Ravi Tandon
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2405.11195
Pdf link: https://arxiv.org/pdf/2405.11195
Abstract Counterfactuals, or modified inputs that lead to a different outcome, are an important tool for understanding the logic used by machine learning classifiers and how to change an undesirable classification. Even if a counterfactual changes a classifier's decision, however, it may not affect the true underlying class probabilities, i.e. the counterfactual may act like an adversarial attack and ``fool'' the classifier. We propose a new framework for creating modified inputs that change the true underlying probabilities in a beneficial way which we call Trustworthy Actionable Perturbations (TAP). This includes a novel verification procedure to ensure that TAP change the true class probabilities instead of acting adversarially. Our framework also includes new cost, reward, and goal definitions that are better suited to effectuating change in the real world. We present PAC-learnability results for our verification procedure and theoretically analyze our new method for measuring reward. We also develop a methodology for creating TAP and compare our results to those achieved by previous counterfactual methods.
OTLP: Output Thresholding Using Mixed Integer Linear Programming
Authors: Baran Koseoglu, Luca Traverso, Mohammed Topiwalla, Egor Kraev, Zoltan Szopory
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11230
Pdf link: https://arxiv.org/pdf/2405.11230
Abstract Output thresholding is the technique to search for the best threshold to be used during inference for any classifiers that can produce probability estimates on train and testing datasets. It is particularly useful in high imbalance classification problems where the default threshold is not able to refer to imbalance in class distributions and fail to give the best performance. This paper proposes OTLP, a thresholding framework using mixed integer linear programming which is model agnostic, can support different objective functions and different set of constraints for a diverse set of problems including both balanced and imbalanced classification problems. It is particularly useful in real world applications where the theoretical thresholding techniques are not able to address to product related requirements and complexity of the applications which utilize machine learning models. Through the use of Credit Card Fraud Detection Dataset, we evaluate the usefulness of the framework.
Strided Difference Bound Matrices
Authors: Arjun Pitchanathan, Albert Cohen, Oleksandr Zinenko, Tobias Grosser
Subjects: Subjects: Symbolic Computation (cs.SC); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2405.11244
Pdf link: https://arxiv.org/pdf/2405.11244
Abstract A wide range of symbolic analysis and optimization problems can be formalized using polyhedra. Sub-classes of polyhedra, also known as sub-polyhedral domains, are sought for their lower space and time complexity. We introduce the Strided Difference Bound Matrix (SDBM) domain, which represents a sweet spot in the context of optimizing compilers. Its expressiveness and efficient algorithms are particularly well suited to the construction of machine learning compilers. We present decision algorithms, abstract domain operators and computational complexity proofs for SDBM. We also conduct an empirical study with the MLIR compiler framework to validate the domain's practical applicability. We characterize a sub-class of SDBMs that frequently occurs in practice, and demonstrate even faster algorithms on this sub-class.
Few-Shot API Attack Anomaly Detection in a Classification-by-Retrieval Framework
Authors: Udi Aharon, Ran Dubin, Amit Dvir, Chen Hajaj
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2405.11247
Pdf link: https://arxiv.org/pdf/2405.11247
Abstract Application Programming Interface (API) attacks refer to the unauthorized or malicious use of APIs, which are often exploited to gain access to sensitive data or manipulate online systems for illicit purposes. Identifying actors that deceitfully utilize an API poses a demanding problem. Although there have been notable advancements and contributions in the field of API security, there still remains a significant challenge when dealing with attackers who use novel approaches that don't match the well-known payloads commonly seen in attacks. Also, attackers may exploit standard functionalities in unconventional manners and with objectives surpassing their intended boundaries. This means API security needs to be more sophisticated and dynamic than ever, with advanced computational intelligence methods, such as machine learning models that can quickly identify and respond to anomalous behavior. In response to these challenges, we propose a novel few-shot anomaly detection framework, named FT-ANN. This framework is composed of two parts: First, we train a dedicated generic language model for API based on FastText embedding. Next, we use Approximate Nearest Neighbor search in a classification-by-retrieval approach. Our framework enables the development of a lightweight model that can be trained with minimal examples per class or even a model capable of classifying multiple classes. The results show that our framework effectively improves API attack detection accuracy compared to various baselines.
PlantTracing: Tracing Arabidopsis Thaliana Apex with CenterTrack
Authors: Yuanzhe Liu, Yixiang Mao, Yao Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2405.11351
Pdf link: https://arxiv.org/pdf/2405.11351
Abstract This work applies an encoder-decoder-based machine learning network to detect and track the motion and growth of the flowering stem apex of Arabidopsis Thaliana. Based on the CenterTrack, a machine learning back-end network, we trained a model based on ten time-lapsed labeled videos and tested against three videos.
An Opportunistically Parallel Lambda Calculus for Performant Composition of Large Language Models
Authors: Stephen Mell, Steve Zdancewic, Osbert Bastani
Subjects: Subjects: Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2405.11361
Pdf link: https://arxiv.org/pdf/2405.11361
Abstract Large language models (LLMs) have shown impressive results at a wide-range of tasks. However, they have limitations, such as hallucinating facts and struggling with arithmetic. Recent work has addressed these issues with sophisticated decoding techniques. However, performant decoding, particularly for sophisticated techniques, relies crucially on parallelization and batching, which are difficult for developers. We make two observations: 1) existing approaches are high-level domain-specific languages for gluing expensive black-box calls, but are not general or compositional; 2) LLM programs are essentially pure (all effects commute). Guided by these observations, we develop a novel, general-purpose lambda calculus for automatically parallelizing a wide-range of LLM interactions, without user intervention. The key difference versus standard lambda calculus is a novel "opportunistic" evaluation strategy, which steps independent parts of a program in parallel, dispatching black-box external calls as eagerly as possible, even while data-independent parts of the program are waiting for their own external calls to return. To maintain the simplicity of the language and to ensure uniformity of opportunistic evaluation, control-flow and looping constructs are implemented in-language, via Church encodings. We implement this approach in a framework called EPIC, embedded in--and interoperating closely with--Python. We demonstrate its versatility and performance with three case studies drawn from the machine learning literature: Tree-of-Thoughts (LLMs embedded in classic search procedures), nested tool use, and constrained decoding. Our experiments show that opportunistic evaluation offers a $1.5\times$ to $4.8\times$ speedup over sequential evaluation, while still allowing practitioners to write straightforward and composable programs, without any manual parallelism or batching.
Preparing for Black Swans: The Antifragility Imperative for Machine Learning
Authors: Ming Jin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2405.11397
Pdf link: https://arxiv.org/pdf/2405.11397
Abstract Operating safely and reliably despite continual distribution shifts is vital for high-stakes machine learning applications. This paper builds upon the transformative concept of ``antifragility'' introduced by (Taleb, 2014) as a constructive design paradigm to not just withstand but benefit from volatility. We formally define antifragility in the context of online decision making as dynamic regret's strictly concave response to environmental variability, revealing limitations of current approaches focused on resisting rather than benefiting from nonstationarity. Our contribution lies in proposing potential computational pathways for engineering antifragility, grounding the concept in online learning theory and drawing connections to recent advancements in areas such as meta-learning, safe exploration, continual learning, multi-objective/quality-diversity optimization, and foundation models. By identifying promising mechanisms and future research directions, we aim to put antifragility on a rigorous theoretical foundation in machine learning. We further emphasize the need for clear guidelines, risk assessment frameworks, and interdisciplinary collaboration to ensure responsible application.
Review of deep learning models for crypto price prediction: implementation and evaluation
Authors: Jingyang Wu, Xinyi Zhang, Fangyixuan Huang, Haochen Zhou, Rohtiash Chandra
Subjects: Subjects: Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2405.11431
Pdf link: https://arxiv.org/pdf/2405.11431
Abstract There has been much interest in accurate cryptocurrency price forecast models by investors and researchers. Deep Learning models are prominent machine learning techniques that have transformed various fields and have shown potential for finance and economics. Although various deep learning models have been explored for cryptocurrency price forecasting, it is not clear which models are suitable due to high market volatility. In this study, we review the literature about deep learning for cryptocurrency price forecasting and evaluate novel deep learning models for cryptocurrency stock price prediction. Our deep learning models include variants of long short-term memory (LSTM) recurrent neural networks, variants of convolutional neural networks (CNNs), and the Transformer model. We evaluate univariate and multivariate approaches for multi-step ahead predicting of cryptocurrencies close-price. Our results show that the univariate LSTM model variants perform best for cryptocurrency predictions. We also carry out volatility analysis on the four cryptocurrencies which reveals significant fluctuations in their prices throughout the COVID-19 pandemic. Additionally, we investigate the prediction accuracy of two scenarios identified by different training sets for the models. First, we use the pre-COVID-19 datasets to model cryptocurrency close-price forecasting during the early period of COVID-19. Secondly, we utilise data from the COVID-19 period to predict prices for 2023 to 2024.
A GAN-Based Data Poisoning Attack Against Federated Learning Systems and Its Countermeasure
Authors: Wei Sun, Bo Gao, Ke Xiong, Yuwei Wang, Pingyi Fan, Khaled Ben Letaief
Subjects: Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2405.11440
Pdf link: https://arxiv.org/pdf/2405.11440
Abstract As a distributed machine learning paradigm, federated learning (FL) is collaboratively carried out on privately owned datasets but without direct data access. Although the original intention is to allay data privacy concerns, "available but not visible" data in FL potentially brings new security threats, particularly poisoning attacks that target such "not visible" local data. Initial attempts have been made to conduct data poisoning attacks against FL systems, but cannot be fully successful due to their high chance of causing statistical anomalies. To unleash the potential for truly "invisible" attacks and build a more deterrent threat model, in this paper, a new data poisoning attack model named VagueGAN is proposed, which can generate seemingly legitimate but noisy poisoned data by untraditionally taking advantage of generative adversarial network (GAN) variants. Capable of manipulating the quality of poisoned data on demand, VagueGAN enables to trade-off attack effectiveness and stealthiness. Furthermore, a cost-effective countermeasure named Model Consistency-Based Defense (MCD) is proposed to identify GAN-poisoned data or models after finding out the consistency of GAN outputs. Extensive experiments on multiple datasets indicate that our attack method is generally much more stealthy as well as more effective in degrading FL performance with low complexity. Our defense method is also shown to be more competent in identifying GAN-poisoned data or models. The source codes are publicly available at \href{this https URL}{this https URL}.
NetMamba: Efficient Network Traffic Classification via Pre-training Unidirectional Mamba
Authors: Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Youjian Zhao, Yong Cui
Subjects: Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2405.11449
Pdf link: https://arxiv.org/pdf/2405.11449
Abstract Network traffic classification is a crucial research area aiming to enhance service quality, streamline network management, and bolster cybersecurity. To address the growing complexity of transmission encryption techniques, various machine learning and deep learning methods have been proposed. However, existing approaches encounter two main challenges. Firstly, they struggle with model inefficiency due to the quadratic complexity of the widely used Transformer architecture. Secondly, they suffer from unreliable traffic representation because of discarding important byte information while retaining unwanted biases. To address these challenges, we propose NetMamba, an efficient linear-time state space model equipped with a comprehensive traffic representation scheme. We replace the Transformer with our specially selected and improved Mamba architecture for the networking field to address efficiency issues. In addition, we design a scheme for traffic representation, which is used to extract valid information from massive traffic while removing biased information. Evaluation experiments on six public datasets encompassing three main classification tasks showcase NetMamba's superior classification performance compared to state-of-the-art baselines. It achieves up to 4.83\% higher accuracy and 4.64\% higher f1 score on encrypted traffic classification tasks. Additionally, NetMamba demonstrates excellent efficiency, improving inference speed by 2.24 times while maintaining comparably low memory usage. Furthermore, NetMamba exhibits superior few-shot learning abilities, achieving better classification performance with fewer labeled data. To the best of our knowledge, NetMamba is the first model to tailor the Mamba architecture for networking.
Error Analysis of Three-Layer Neural Network Trained with PGD for Deep Ritz Method
Authors: Yuling Jiao, Yanming Lai, Yang Wang
Subjects: Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2405.11451
Pdf link: https://arxiv.org/pdf/2405.11451
Abstract Machine learning is a rapidly advancing field with diverse applications across various domains. One prominent area of research is the utilization of deep learning techniques for solving partial differential equations(PDEs). In this work, we specifically focus on employing a three-layer tanh neural network within the framework of the deep Ritz method(DRM) to solve second-order elliptic equations with three different types of boundary conditions. We perform projected gradient descent(PDG) to train the three-layer network and we establish its global convergence. To the best of our knowledge, we are the first to provide a comprehensive error analysis of using overparameterized networks to solve PDE problems, as our analysis simultaneously includes estimates for approximation error, generalization error, and optimization error. We present error bound in terms of the sample size $n$ and our work provides guidance on how to set the network depth, width, step size, and number of iterations for the projected gradient descent algorithm. Importantly, our assumptions in this work are classical and we do not require any additional assumptions on the solution of the equation. This ensures the broad applicability and generality of our results.
Comparisons Are All You Need for Optimizing Smooth Functions
Authors: Chenyi Zhang, Tongyang Li
Subjects: Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2405.11454
Pdf link: https://arxiv.org/pdf/2405.11454
Abstract When optimizing machine learning models, there are various scenarios where gradient computations are challenging or even infeasible. Furthermore, in reinforcement learning (RL), preference-based RL that only compares between options has wide applications, including reinforcement learning with human feedback in large language models. In this paper, we systematically study optimization of a smooth function $f\colon\mathbb{R}^n\to\mathbb{R}$ only assuming an oracle that compares function values at two points and tells which is larger. When $f$ is convex, we give two algorithms using $\tilde{O}(n/\epsilon)$ and $\tilde{O}(n^{2})$ comparison queries to find an $\epsilon$-optimal solution, respectively. When $f$ is nonconvex, our algorithm uses $\tilde{O}(n/\epsilon^2)$ comparison queries to find an $\epsilon$-approximate stationary point. All these results match the best-known zeroth-order algorithms with function evaluation queries in $n$ dependence, thus suggest that \emph{comparisons are all you need for optimizing smooth functions using derivative-free methods}. In addition, we also give an algorithm for escaping saddle points and reaching an $\epsilon$-second order stationary point of a nonconvex $f$, using $\tilde{O}(n^{1.5}/\epsilon^{2.5})$ comparison queries.
Machine Learning & Wi-Fi: Unveiling the Path Towards AI/ML-Native IEEE 802.11 Networks
Authors: Francesc Wilhelmi, Szymon Szott, Katarzyna Kosek-Szott, Boris Bellalta
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2405.11504
Pdf link: https://arxiv.org/pdf/2405.11504
Abstract Artificial intelligence (AI) and machine learning (ML) are nowadays mature technologies considered essential for driving the evolution of future communications systems. Simultaneously, Wi-Fi technology has constantly evolved over the past three decades and incorporated new features generation after generation, thus gaining in complexity. As such, researchers have observed that AI/ML functionalities may be required to address the upcoming Wi-Fi challenges that will be otherwise difficult to solve with traditional approaches. This paper discusses the role of AI/ML in current and future Wi-Fi networks and depicts the ways forward. A roadmap towards AI/ML-native Wi-Fi, key challenges, standardization efforts, and major enablers are also discussed. An exemplary use case is provided to showcase the potential of AI/ML in Wi-Fi at different adoption stages.
On the Convergence of No-Regret Dynamics in Information Retrieval Games with Proportional Ranking Functions
Authors: Omer Madmon, Idan Pipano, Itamar Reinman, Moshe Tennenholtz
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2405.11517
Pdf link: https://arxiv.org/pdf/2405.11517
Abstract Publishers who publish their content on the web act strategically, in a behavior that can be modeled within the online learning framework. Regret, a central concept in machine learning, serves as a canonical measure for assessing the performance of learning agents within this framework. We prove that any proportional content ranking function with a concave activation function induces games in which no-regret learning dynamics converge. Moreover, for proportional ranking functions, we prove the equivalence of the concavity of the activation function, the social concavity of the induced games and the concavity of the induced games. We also study the empirical trade-offs between publishers' and users' welfare, under different choices of the activation function, using a state-of-the-art no-regret dynamics algorithm. Furthermore, we demonstrate how the choice of the ranking function and changes in the ecosystem structure affect these welfare measures, as well as the dynamics' convergence rate.
Global Convergence of Decentralized Retraction-Free Optimization on the Stiefel Manifold
Authors: Youbang Sun, Shixiang Chen, Alfredo Garcia, Shahin Shahrampour
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2405.11590
Pdf link: https://arxiv.org/pdf/2405.11590
Abstract Many classical and modern machine learning algorithms require solving optimization tasks under orthogonal constraints. Solving these tasks often require calculating retraction-based gradient descent updates on the corresponding Riemannian manifold, which can be computationally expensive. Recently Ablin et al. proposed an infeasible retraction-free algorithm, which is significantly more efficient. In this paper, we study the decentralized non-convex optimization task over a network of agents on the Stiefel manifold with retraction-free updates. We propose \textbf{D}ecentralized \textbf{R}etraction-\textbf{F}ree \textbf{G}radient \textbf{T}racking (DRFGT) algorithm, and show that DRFGT exhibits ergodic $\mathcal{O}(1/K)$ convergence rate, the same rate of convergence as the centralized, retraction-based methods. We also provide numerical experiments demonstrating that DRFGT performs on par with the state-of-the-art retraction based methods with substantially reduced computational overhead.
How to integrate cloud service, data analytic and machine learning technique to reduce cyber risks associated with the modern cloud based infrastructure
Authors: Upakar Bhatta
Subjects: Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2405.11601
Pdf link: https://arxiv.org/pdf/2405.11601
Abstract The combination of cloud technology, machine learning, and data visualization techniques allows hybrid enterprise networks to hold massive volumes of data and provide employees and customers easy access to these cloud data. These massive collections of complex data sets are facing security challenges. While cloud platforms are more vulnerable to security threats and traditional security technologies are unable to cope with the rapid data explosion in cloud platforms, machine learning powered security solutions and data visualization techniques are playing instrumental roles in detecting security threat, data breaches, and automatic finding software vulnerabilities. The purpose of this paper is to present some of the widely used cloud services, machine learning techniques and data visualization approach and demonstrate how to integrate cloud service, data analytic and machine learning techniques that can be used to detect and reduce cyber risks associated with the modern cloud based infrastructure. In this paper I applied the machine learning supervised classifier to design a model based on well-known UNSW-NB15 dataset to predict the network behavior metrics and demonstrated how data analytics techniques can be integrated to visualize network traffics.
Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection
Authors: Abdulla Al-Subaiey, Mohammed Al-Thani, Naser Abdullah Alam, Kaniz Fatema Antora, Amith Khandakar, SM Ashfaq Uz Zaman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2405.11619
Pdf link: https://arxiv.org/pdf/2405.11619
Abstract Phishing emails continue to pose a significant threat, causing financial losses and security breaches. This study addresses limitations in existing research, such as reliance on proprietary datasets and lack of real-world application, by proposing a high-performance machine learning model for email classification. Utilizing a comprehensive and largest available public dataset, the model achieves a f1 score of 0.99 and is designed for deployment within relevant applications. Additionally, Explainable AI (XAI) is integrated to enhance user trust. This research offers a practical and highly accurate solution, contributing to the fight against phishing by empowering users with a real-time web-based application for phishing email detection.
Movie Revenue Prediction using Machine Learning Models
Authors: Vikranth Udandarao, Pratyush Gupta
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11651
Pdf link: https://arxiv.org/pdf/2405.11651
Abstract In the contemporary film industry, accurately predicting a movie's earnings is paramount for maximizing profitability. This project aims to develop a machine learning model for predicting movie earnings based on input features like the movie name, the MPAA rating of the movie, the genre of the movie, the year of release of the movie, the IMDb Rating, the votes by the watchers, the director, the writer and the leading cast, the country of production of the movie, the budget of the movie, the production company and the runtime of the movie. Through a structured methodology involving data collection, preprocessing, analysis, model selection, evaluation, and improvement, a robust predictive model is constructed. Linear Regression, Decision Trees, Random Forest Regression, Bagging, XGBoosting and Gradient Boosting have been trained and tested. Model improvement strategies include hyperparameter tuning and cross-validation. The resulting model offers promising accuracy and generalization, facilitating informed decision-making in the film industry to maximize profits.
Interpretable Machine Learning Enhances Disease Prognosis: Applications on COVID-19 and Onward
Authors: Ke Ma, Jinzhi Shen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11672
Pdf link: https://arxiv.org/pdf/2405.11672
Abstract In response to the COVID-19 pandemic, the integration of interpretable machine learning techniques has garnered significant attention, offering transparent and understandable insights crucial for informed clinical decision making. This literature review delves into the applications of interpretable machine learning in predicting the prognosis of respiratory diseases, particularly focusing on COVID-19 and its implications for future research and clinical practice. We reviewed various machine learning models that are not only capable of incorporating existing clinical domain knowledge but also have the learning capability to explore new information from the data. These models and experiences not only aid in managing the current crisis but also hold promise for addressing future disease outbreaks. By harnessing interpretable machine learning, healthcare systems can enhance their preparedness and response capabilities, thereby improving patient outcomes and mitigating the impact of respiratory diseases in the years to come.
Learning Regularities from Data using Spiking Functions: A Theory
Authors: Canlin Zhang, Xiuwen Liu
Subjects: Subjects: Machine Learning (cs.LG); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2405.11684
Pdf link: https://arxiv.org/pdf/2405.11684
Abstract Deep neural networks trained in an end-to-end manner are proven to be efficient in a wide range of machine learning tasks. However, there is one drawback of end-to-end learning: The learned features and information are implicitly represented in neural network parameters, which cannot be used as regularities, concepts or knowledge to explicitly represent the data probability distribution. To resolve this issue, we propose in this paper a new machine learning theory, which defines in mathematics what are regularities. Briefly, regularities are concise representations of the non-random features, or 'non-randomness' in the data probability distribution. Combining with information theory, we claim that regularities can also be regarded as a small amount of information encoding a large amount of information. Our theory is based on spiking functions. That is, if a function can react to, or spike on specific data samples more frequently than random noise inputs, we say that such a function discovers non-randomness from the data distribution, and encodes the non-randomness into regularities. Our theory also discusses applying multiple spiking functions to the same data distribution. In this process, we claim that the 'best' regularities, or the optimal spiking functions, are those who can capture the largest amount of information from the data distribution, and then encode the captured information in the most concise way. Theorems and hypotheses are provided to describe in mathematics what are 'best' regularities and optimal spiking functions. Finally, we propose a machine learning approach, which can potentially obtain the optimal spiking functions regarding the given dataset in practice.
Contactless Polysomnography: What Radio Waves Tell Us about Sleep
Authors: Hao He, Chao Li, Wolfgang Ganglberger, Kaileigh Gallagher, Rumen Hristov, Michail Ouroutzoglou, Haoqi Sun, Jimeng Sun, Brandon Westover, Dina Katabi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2405.11739
Pdf link: https://arxiv.org/pdf/2405.11739
Abstract The ability to assess sleep at home, capture sleep stages, and detect the occurrence of apnea (without on-body sensors) simply by analyzing the radio waves bouncing off people's bodies while they sleep is quite powerful. Such a capability would allow for longitudinal data collection in patients' homes, informing our understanding of sleep and its interaction with various diseases and their therapeutic responses, both in clinical trials and routine care. In this article, we develop an advanced machine learning algorithm for passively monitoring sleep and nocturnal breathing from radio waves reflected off people while asleep. Validation results in comparison with the gold standard (i.e., polysomnography) (n=849) demonstrate that the model captures the sleep hypnogram (with an accuracy of 81% for 30-second epochs categorized into Wake, Light Sleep, Deep Sleep, or REM), detects sleep apnea (AUROC = 0.88), and measures the patient's Apnea-Hypopnea Index (ICC=0.95; 95% CI = [0.93, 0.97]). Notably, the model exhibits equitable performance across race, sex, and age. Moreover, the model uncovers informative interactions between sleep stages and a range of diseases including neurological, psychiatric, cardiovascular, and immunological disorders. These findings not only hold promise for clinical practice and interventional trials but also underscore the significance of sleep as a fundamental component in understanding and managing various diseases.
Fed-Credit: Robust Federated Learning with Credibility Management
Authors: Jiayan Chen, Zhirong Qian, Tianhui Meng, Xitong Gao, Tian Wang, Weijia Jia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2405.11758
Pdf link: https://arxiv.org/pdf/2405.11758
Abstract Aiming at privacy preservation, Federated Learning (FL) is an emerging machine learning approach enabling model training on decentralized devices or data sources. The learning mechanism of FL relies on aggregating parameter updates from individual clients. However, this process may pose a potential security risk due to the presence of malicious devices. Existing solutions are either costly due to the use of compute-intensive technology, or restrictive for reasons of strong assumptions such as the prior knowledge of the number of attackers and how they attack. Few methods consider both privacy constraints and uncertain attack scenarios. In this paper, we propose a robust FL approach based on the credibility management scheme, called Fed-Credit. Unlike previous studies, our approach does not require prior knowledge of the nodes and the data distribution. It maintains and employs a credibility set, which weighs the historical clients' contributions based on the similarity between the local models and global model, to adjust the global model update. The subtlety of Fed-Credit is that the time decay and attitudinal value factor are incorporated into the dynamic adjustment of the reputation weights and it boasts a computational complexity of O(n) (n is the number of the clients). We conducted extensive experiments on the MNIST and CIFAR-10 datasets under 5 types of attacks. The results exhibit superior accuracy and resilience against adversarial attacks, all while maintaining comparatively low computational complexity. Among these, on the Non-IID CIFAR-10 dataset, our algorithm exhibited performance enhancements of 19.5% and 14.5%, respectively, in comparison to the state-of-the-art algorithm when dealing with two types of data poisoning attacks.
Uncertainty of interpretability in Landslide Susceptibility Mapping: A Comparative Analysis of Statistical, Machine Learning, and Deep Learning Models
Authors: Cheng Chen, Lei Fan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11762
Pdf link: https://arxiv.org/pdf/2405.11762
Abstract Landslide susceptibility mapping (LSM) is crucial for identifying high-risk areas and informing prevention strategies. This study investigates the interpretability of statistical, machine learning (ML), and deep learning (DL) models in predicting landslide susceptibility. This is achieved by incorporating various relevant interpretation methods and two types of input factors: a comprehensive set of 19 contributing factors that are statistically relevant to landslides, as well as a dedicated set of 9 triggering factors directly associated with triggering landslides. Given that model performance is a crucial metric in LSM, our investigations into interpretability naturally involve assessing and comparing LSM accuracy across different models considered. In our investigation, the convolutional neural network model achieved the highest accuracy (0.8447 with 19 factors; 0.8048 with 9 factors), while Extreme Gradient Boosting and Support Vector Machine also demonstrated strong predictive capabilities, outperforming conventional statistical models. These findings indicate that DL and sophisticated ML algorithms can effectively capture the complex relationships between input factors and landslide occurrence. However, the interpretability of predictions varied among different models, particularly when using the broader set of 19 contributing factors. Explanation methods like SHAP, LIME, and DeepLIFT also led to variations in interpretation results. Using a comprehensive set of 19 contributing factors improved prediction accuracy but introduced complexities and inconsistency in model interpretations. Focusing on a dedicated set of 9 triggering factors sacrificed some predictive power but enhanced interpretability, as evidenced by more consistent key factors identified across various models and alignment with the findings of field investigation reports....
From SHAP Scores to Feature Importance Scores
Authors: Olivier Letoffe, Xuanxiang Huang, Nicholas Asher, Joao Marques-Silva
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11766
Pdf link: https://arxiv.org/pdf/2405.11766
Abstract A central goal of eXplainable Artificial Intelligence (XAI) is to assign relative importance to the features of a Machine Learning (ML) model given some prediction. The importance of this task of explainability by feature attribution is illustrated by the ubiquitous recent use of tools such as SHAP and LIME. Unfortunately, the exact computation of feature attributions, using the game-theoretical foundation underlying SHAP and LIME, can yield manifestly unsatisfactory results, that tantamount to reporting misleading relative feature importance. Recent work targeted rigorous feature attribution, by studying axiomatic aggregations of features based on logic-based definitions of explanations by feature selection. This paper shows that there is an essential relationship between feature attribution and a priori voting power, and that those recently proposed axiomatic aggregations represent a few instantiations of the range of power indices studied in the past. Furthermore, it remains unclear how some of the most widely used power indices might be exploited as feature importance scores (FISs), i.e. the use of power indices in XAI, and which of these indices would be the best suited for the purposes of XAI by feature attribution, namely in terms of not producing results that could be deemed as unsatisfactory. This paper proposes novel desirable properties that FISs should exhibit. In addition, the paper also proposes novel FISs exhibiting the proposed properties. Finally, the paper conducts a rigorous analysis of the best-known power indices in terms of the proposed properties.
LSEnet: Lorentz Structural Entropy Neural Network for Deep Graph Clustering
Authors: Li Sun, Zhenhao Huang, Hao Peng, Yujie Wang, Chunyang Liu, Philip S. Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11801
Pdf link: https://arxiv.org/pdf/2405.11801
Abstract Graph clustering is a fundamental problem in machine learning. Deep learning methods achieve the state-of-the-art results in recent years, but they still cannot work without predefined cluster numbers. Such limitation motivates us to pose a more challenging problem of graph clustering with unknown cluster number. We propose to address this problem from a fresh perspective of graph information theory (i.e., structural information). In the literature, structural information has not yet been introduced to deep clustering, and its classic definition falls short of discrete formulation and modeling node features. In this work, we first formulate a differentiable structural information (DSI) in the continuous realm, accompanied by several theoretical results. By minimizing DSI, we construct the optimal partitioning tree where densely connected nodes in the graph tend to have the same assignment, revealing the cluster structure. DSI is also theoretically presented as a new graph clustering objective, not requiring the predefined cluster number. Furthermore, we design a neural LSEnet in the Lorentz model of hyperbolic space, where we integrate node features to structural information via manifold-valued graph convolution. Extensive empirical results on real graphs show the superiority of our approach.
FedCAda: Adaptive Client-Side Optimization for Accelerated and Stable Federated Learning
Authors: Liuzhi Zhou, Yu He, Kun Zhai, Xiang Liu, Sen Liu, Xingjun Ma, Guangnan Ye, Yu-Gang Jiang, Hongfeng Chai
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2405.11811
Pdf link: https://arxiv.org/pdf/2405.11811
Abstract Federated learning (FL) has emerged as a prominent approach for collaborative training of machine learning models across distributed clients while preserving data privacy. However, the quest to balance acceleration and stability becomes a significant challenge in FL, especially on the client-side. In this paper, we introduce FedCAda, an innovative federated client adaptive algorithm designed to tackle this challenge. FedCAda leverages the Adam algorithm to adjust the correction process of the first moment estimate $m$ and the second moment estimate $v$ on the client-side and aggregate adaptive algorithm parameters on the server-side, aiming to accelerate convergence speed and communication efficiency while ensuring stability and performance. Additionally, we investigate several algorithms incorporating different adjustment functions. This comparative analysis revealed that due to the limited information contained within client models from other clients during the initial stages of federated learning, more substantial constraints need to be imposed on the parameters of the adaptive algorithm. As federated learning progresses and clients gather more global information, FedCAda gradually diminishes the impact on adaptive parameters. These findings provide insights for enhancing the robustness and efficiency of algorithmic improvements. Through extensive experiments on computer vision (CV) and natural language processing (NLP) datasets, we demonstrate that FedCAda outperforms the state-of-the-art methods in terms of adaptability, convergence, stability, and overall performance. This work contributes to adaptive algorithms for federated learning, encouraging further exploration.
Towards Graph Contrastive Learning: A Survey and Beyond
Authors: Wei Ju, Yifan Wang, Yifang Qin, Zhengyang Mao, Zhiping Xiao, Junyu Luo, Junwei Yang, Yiyang Gu, Dongjie Wang, Qingqing Long, Siyu Yi, Xiao Luo, Ming Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2405.11868
Pdf link: https://arxiv.org/pdf/2405.11868
Abstract In recent years, deep learning on graphs has achieved remarkable success in various domains. However, the reliance on annotated graph data remains a significant bottleneck due to its prohibitive cost and time-intensive nature. To address this challenge, self-supervised learning (SSL) on graphs has gained increasing attention and has made significant progress. SSL enables machine learning models to produce informative representations from unlabeled graph data, reducing the reliance on expensive labeled data. While SSL on graphs has witnessed widespread adoption, one critical component, Graph Contrastive Learning (GCL), has not been thoroughly investigated in the existing literature. Thus, this survey aims to fill this gap by offering a dedicated survey on GCL. We provide a comprehensive overview of the fundamental principles of GCL, including data augmentation strategies, contrastive modes, and contrastive optimization objectives. Furthermore, we explore the extensions of GCL to other aspects of data-efficient graph learning, such as weakly supervised learning, transfer learning, and related scenarios. We also discuss practical applications spanning domains such as drug discovery, genomics analysis, recommender systems, and finally outline the challenges and potential future directions in this field.
A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus
Authors: Eduard Poesina, Cornelia Caragea, Radu Tudor Ionescu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11877
Pdf link: https://arxiv.org/pdf/2405.11877
Abstract Natural language inference (NLI), the task of recognizing the entailment relationship in sentence pairs, is an actively studied topic serving as a proxy for natural language understanding. Despite the relevance of the task in building conversational agents and improving text classification, machine translation and other NLP tasks, to the best of our knowledge, there is no publicly available NLI corpus for the Romanian language. To this end, we introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs, which are obtained via distant supervision, and 6K validation and test sentence pairs, which are manually annotated with the correct labels. We conduct experiments with multiple machine learning methods based on distant learning, ranging from shallow models based on word embeddings to transformer-based neural networks, to establish a set of competitive baselines. Furthermore, we improve on the best model by employing a new curriculum learning strategy based on data cartography. Our dataset and code to reproduce the baselines are available this https URL.
Out-of-Distribution Detection with a Single Unconditional Diffusion Model
Authors: Alvin Heng, Alexandre H. Thiery, Harold Soh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2405.11881
Pdf link: https://arxiv.org/pdf/2405.11881
Abstract Out-of-distribution (OOD) detection is a critical task in machine learning that seeks to identify abnormal samples. Traditionally, unsupervised methods utilize a deep generative model for OOD detection. However, such approaches necessitate a different model when evaluating abnormality against a new distribution. With the emergence of foundational generative models, this paper explores whether a single generalist model can also perform OOD detection across diverse tasks. To that end, we introduce our method, Diffusion Paths, (DiffPath) in this work. DiffPath proposes to utilize a single diffusion model originally trained to perform unconditional generation for OOD detection. Specifically, we introduce a novel technique of measuring the rate-of-change and curvature of the diffusion paths connecting samples to the standard normal. Extensive experiments show that with a single model, DiffPath outperforms prior work on a variety of OOD tasks involving different distributions. Our code is publicly available at this https URL.
Ensemble and Mixture-of-Experts DeepONets For Operator Learning
Authors: Ramansh Sharma, Varun Shankar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11907
Pdf link: https://arxiv.org/pdf/2405.11907
Abstract We present a novel deep operator network (DeepONet) architecture for operator learning, the ensemble DeepONet, that allows for enriching the trunk network of a single DeepONet with multiple distinct trunk networks. This trunk enrichment allows for greater expressivity and generalization capabilities over a range of operator learning problems. We also present a spatial mixture-of-experts (MoE) DeepONet trunk network architecture that utilizes a partition-of-unity (PoU) approximation to promote spatial locality and model sparsity in the operator learning problem. We first prove that both the ensemble and PoU-MoE DeepONets are universal approximators. We then demonstrate that ensemble DeepONets containing a trunk ensemble of a standard trunk, the PoU-MoE trunk, and/or a proper orthogonal decomposition (POD) trunk can achieve 2-4x lower relative $\ell_2$ errors than standard DeepONets and POD-DeepONets on both standard and challenging new operator learning problems involving partial differential equations (PDEs) in two and three dimensions. Our new PoU-MoE formulation provides a natural way to incorporate spatial locality and model sparsity into any neural network architecture, while our new ensemble DeepONet provides a powerful and general framework for incorporating basis enrichment in scientific machine learning architectures for operator learning.
On Efficient and Statistical Quality Estimation for Data Annotation
Authors: Jan-Christoph Klie, Rahul Nair, Juan Haladjian, Marc Kirchner
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2405.11919
Pdf link: https://arxiv.org/pdf/2405.11919
Abstract Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.
Data Contamination Calibration for Black-box LLMs
Authors: Wentao Ye, Jiaqi Hu, Liyao Li, Haobo Wang, Gang Chen, Junbo Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.11930
Pdf link: https://arxiv.org/pdf/2405.11930
Abstract The rapid advancements of Large Language Models (LLMs) tightly associate with the expansion of the training data size. However, the unchecked ultra-large-scale training sets introduce a series of potential risks like data contamination, i.e. the benchmark data is used for training. In this work, we propose a holistic method named Polarized Augment Calibration (PAC) along with a new to-be-released dataset to detect the contaminated data and diminish the contamination effect. PAC extends the popular MIA (Membership Inference Attack) -- from machine learning community -- by forming a more global target at detecting training data to Clarify invisible training data. As a pioneering work, PAC is very much plug-and-play that can be integrated with most (if not all) current white- and black-box LLMs. By extensive experiments, PAC outperforms existing methods by at least 4.5%, towards data contamination detection on more 4 dataset formats, with more than 10 base LLMs. Besides, our application in real-world scenarios highlights the prominent presence of contamination and related issues.
Safe by Design Autonomous Driving Systems
Authors: Marius Bozga, Joseph Sifakis
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2405.11995
Pdf link: https://arxiv.org/pdf/2405.11995
Abstract Developing safe autonomous driving systems is a major scientific and technical challenge. Existing AI-based end-to-end solutions do not offer the necessary safety guarantees, while traditional systems engineering approaches are defeated by the complexity of the problem. Currently, there is an increasing interest in hybrid design solutions, integrating machine learning components, when necessary, while using model-based components for goal management and planning. We study a method for building safe by design autonomous driving systems, based on the assumption that the capability to drive boils down to the coordinated execution of a given set of driving operations. The assumption is substantiated by a compositionality result considering that autopilots are dynamic systems receiving a small number of types of vistas as input, each vista defining a free space in its neighborhood. It is shown that safe driving for each type of vista in the corresponding free space, implies safe driving for any possible scenario under some easy-to-check conditions concerning the transition between vistas. The designed autopilot comprises distinct control policies one per type of vista, articulated in two consecutive phases. The first phase consists of carefully managing a potentially risky situation by virtually reducing speed, while the second phase consists of exiting the situation by accelerating. The autopilots designed use for their predictions simple functions characterizing the acceleration and deceleration capabilities of the vehicles. They cover the main driving operations, including entering a main road, overtaking, crossing intersections protected by traffic lights or signals, and driving on freeways. The results presented reinforce the case for hybrid solutions that incorporate mathematically elegant and robust decision methods that are safe by design.
Energy-Efficient Federated Edge Learning with Streaming Data: A Lyapunov Optimization Approach
Authors: Chung-Hsuan Hu, Zheng Chen, Erik G. Larsson
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2405.12046
Pdf link: https://arxiv.org/pdf/2405.12046
Abstract Federated learning (FL) has received significant attention in recent years for its advantages in efficient training of machine learning models across distributed clients without disclosing user-sensitive data. Specifically, in federated edge learning (FEEL) systems, the time-varying nature of wireless channels introduces inevitable system dynamics in the communication process, thereby affecting training latency and energy consumption. In this work, we further consider a streaming data scenario where new training data samples are randomly generated over time at edge devices. Our goal is to develop a dynamic scheduling and resource allocation algorithm to address the inherent randomness in data arrivals and resource availability under long-term energy constraints. To achieve this, we formulate a stochastic network optimization problem and use the Lyapunov drift-plus-penalty framework to obtain a dynamic resource management design. Our proposed algorithm makes adaptive decisions on device scheduling, computational capacity adjustment, and allocation of bandwidth and transmit power in every round. We provide convergence analysis for the considered setting with heterogeneous data and time-varying objective functions, which supports the rationale behind our proposed scheduling design. The effectiveness of our scheme is verified through simulation results, demonstrating improved learning performance and energy efficiency as compared to baseline schemes.
GAN-GRID: A Novel Generative Attack on Smart Grid Stability Prediction
Authors: Emad Efatinasab, Alessandro Brighente, Mirco Rampazzo, Nahal Azadi, Mauro Conti
Subjects: Subjects: Cryptography and Security (cs.CR); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2405.12076
Pdf link: https://arxiv.org/pdf/2405.12076
Abstract The smart grid represents a pivotal innovation in modernizing the electricity sector, offering an intelligent, digitalized energy network capable of optimizing energy delivery from source to consumer. It hence represents the backbone of the energy sector of a nation. Due to its central role, the availability of the smart grid is paramount and is hence necessary to have in-depth control of its operations and safety. To this aim, researchers developed multiple solutions to assess the smart grid's stability and guarantee that it operates in a safe state. Artificial intelligence and Machine learning algorithms have proven to be effective measures to accurately predict the smart grid's stability. Despite the presence of known adversarial attacks and potential solutions, currently, there exists no standardized measure to protect smart grids against this threat, leaving them open to new adversarial attacks. In this paper, we propose GAN-GRID a novel adversarial attack targeting the stability prediction system of a smart grid tailored to real-world constraints. Our findings reveal that an adversary armed solely with the stability model's output, devoid of data or model knowledge, can craft data classified as stable with an Attack Success Rate (ASR) of 0.99. Also by manipulating authentic data and sensor values, the attacker can amplify grid issues, potentially undetected due to a compromised stability prediction system. These results underscore the imperative of fortifying smart grid security mechanisms against adversarial manipulation to uphold system stability and reliability.
Channel Balance Interpolation in the Lightning Network via Machine Learning
Authors: Vincent, Emanuele Rossi, Vikash Singh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.12087
Pdf link: https://arxiv.org/pdf/2405.12087
Abstract The Bitcoin Lightning Network is a Layer 2 payment protocol that addresses Bitcoin's scalability by facilitating quick and cost effective transactions through payment channels. This research explores the feasibility of using machine learning models to interpolate channel balances within the network, which can be used for optimizing the network's pathfinding algorithms. While there has been much exploration in balance probing and multipath payment protocols, predicting channel balances using solely node and channel features remains an uncharted area. This paper evaluates the performance of several machine learning models against two heuristic baselines and investigates the predictive capabilities of various features. Our model performs favorably in experimental evaluation, outperforming by 10% against an equal split baseline where both edges are assigned half of the channel capacity.
An Active Learning Framework with a Class Balancing Strategy for Time Series Classification
Authors: Shemonto Das
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.12122
Pdf link: https://arxiv.org/pdf/2405.12122
Abstract Training machine learning models for classification tasks often requires labeling numerous samples, which is costly and time-consuming, especially in time series analysis. This research investigates Active Learning (AL) strategies to reduce the amount of labeled data needed for effective time series classification. Traditional AL techniques cannot control the selection of instances per class for labeling, leading to potential bias in classification performance and instance selection, particularly in imbalanced time series datasets. To address this, we propose a novel class-balancing instance selection algorithm integrated with standard AL strategies. Our approach aims to select more instances from classes with fewer labeled examples, thereby addressing imbalance in time series datasets. We demonstrate the effectiveness of our AL framework in selecting informative data samples for two distinct domains of tactile texture recognition and industrial fault detection. In robotics, our method achieves high-performance texture categorization while significantly reducing labeled training data requirements to 70%. We also evaluate the impact of different sliding window time intervals on robotic texture classification using AL strategies. In synthetic fiber manufacturing, we adapt AL techniques to address the challenge of fault classification, aiming to minimize data annotation cost and time for industries. We also address real-life class imbalances in the multiclass industrial anomalous dataset using our class-balancing instance algorithm integrated with AL strategies. Overall, this thesis highlights the potential of our AL framework across these two distinct domains.
Alzheimer's Magnetic Resonance Imaging Classification Using Deep and Meta-Learning Models
Authors: Nida Nasir, Muneeb Ahmed, Neda Afreen, Mustafa Sameer
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2405.12126
Pdf link: https://arxiv.org/pdf/2405.12126
Abstract Deep learning, a cutting-edge machine learning approach, outperforms traditional machine learning in identifying intricate structures in complex high-dimensional data, particularly in the domain of healthcare. This study focuses on classifying Magnetic Resonance Imaging (MRI) data for Alzheimer's disease (AD) by leveraging deep learning techniques characterized by state-of-the-art CNNs. Brain imaging techniques such as MRI have enabled the measurement of pathophysiological brain changes related to Alzheimer's disease. Alzheimer's disease is the leading cause of dementia in the elderly, and it is an irreversible brain illness that causes gradual cognitive function disorder. In this paper, we train some benchmark deep models individually for the approach of the solution and later use an ensembling approach to combine the effect of multiple CNNs towards the observation of higher recall and accuracy. Here, the model's effectiveness is evaluated using various methods, including stacking, majority voting, and the combination of models with high recall values. The majority voting performs better than the alternative modelling approach as the majority voting approach typically reduces the variance in the predictions. We report a test accuracy of 90% with a precision score of 0.90 and a recall score of 0.89 in our proposed approach. In future, this study can be extended to incorporate other types of medical data, including signals, images, and other data. The same or alternative datasets can be used with additional classifiers, neural networks, and AI techniques to enhance Alzheimer's detection.
Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models
Authors: Tong Zeng, Daniel E. Acuna
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2405.12206
Pdf link: https://arxiv.org/pdf/2405.12206
Abstract Scientist learn early on how to cite scientific sources to support their claims. Sometimes, however, scientists have challenges determining where a citation should be situated -- or, even worse, fail to cite a source altogether. Automatically detecting sentences that need a citation (i.e., citation worthiness) could solve both of these issues, leading to more robust and well-constructed scientific arguments. Previous researchers have applied machine learning to this task but have used small datasets and models that do not take advantage of recent algorithmic developments such as attention mechanisms in deep learning. We hypothesize that we can develop significantly accurate deep learning architectures that learn from large supervised datasets constructed from open access publications. In this work, we propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations. We also produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets. Our experiments show that our architecture achieves state of the art performance on the standard ACL-ARC dataset ($F{1}=0.507$) and exhibits high performance ($F{1}=0.856$) on the new PMOA-CITE. Moreover, we show that it can transfer learning across these datasets. We further use interpretable models to illuminate how specific language is used to promote and inhibit citations. We discover that sections and surrounding sentences are crucial for our improved predictions. We further examined purported mispredictions of the model, and uncovered systematic human mistakes in citation behavior and source data. This opens the door for our model to check documents during pre-submission and pre-archival procedures. We make this new dataset, the code, and a web-based tool available to the community.

qiaoyuet / arxiv_daily

New submissions for Tuesday, 21 May 2024 (showing 583 of 583 entries ) #104

Keyword: differential privacy

Sketches-based join size estimation under local differential privacy

Securing Health Data on the Blockchain: A Differential Privacy and Federated Learning Framework

Decentralized Privacy Preservation for Critical Connections in Graphs

Keyword: privacy

Private Data Leakage in Federated Human Activity Recognition for Wearable Healthcare Devices

Learnable Privacy Neurons Localization in Language Models

"What do you want from theory alone?" Experimenting with Tight Auditing of Differentially Private Synthetic Data Generation

NTTSuite: Number Theoretic Transform Benchmarks for Accelerating Encrypted Computation

Sketches-based join size estimation under local differential privacy

A GAN-Based Data Poisoning Attack Against Federated Learning Systems and Its Countermeasure

Diffusion-Based Hierarchical Image Steganography

Overcoming Data and Model Heterogeneities in Decentralized Federated Learning via Synthetic Anchors

Securing Health Data on the Blockchain: A Differential Privacy and Federated Learning Framework

Trust, Because You Can't Verify:Privacy and Security Hurdles in Education Technology Acquisition Practices

Decentralized Privacy Preservation for Critical Connections in Graphs

Fed-Credit: Robust Federated Learning with Credibility Management

FedCAda: Adaptive Client-Side Optimization for Accelerated and Stable Federated Learning

Federated Learning with Incomplete Sensing Modalities

Information Leakage from Embedding in Large Language Models

Data Augmentation for Text-based Person Retrieval Using Large Language Models

A Stochastic Sampling Approach to Privacy

Attribute-Based Authentication in Secure Group Messaging for Distributed Environments

Keyword: machine learning

Predictive Energy Management for Battery Electric Vehicles with Hybrid Models

GraSS: Combining Graph Neural Networks with Expert Knowledge for SAT Solver Selection

Safety in Graph Machine Learning: Threats and Safeguards

A Comparative Study of Garment Draping Techniques

Dynamic Embeddings with Task-Oriented prompting

Enhancing Automata Learning with Statistical Machine Learning: A Network Security Case Study

Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines

Trustworthy Actionable Perturbations

OTLP: Output Thresholding Using Mixed Integer Linear Programming

Strided Difference Bound Matrices

Few-Shot API Attack Anomaly Detection in a Classification-by-Retrieval Framework

PlantTracing: Tracing Arabidopsis Thaliana Apex with CenterTrack

An Opportunistically Parallel Lambda Calculus for Performant Composition of Large Language Models

Preparing for Black Swans: The Antifragility Imperative for Machine Learning

Review of deep learning models for crypto price prediction: implementation and evaluation

A GAN-Based Data Poisoning Attack Against Federated Learning Systems and Its Countermeasure

NetMamba: Efficient Network Traffic Classification via Pre-training Unidirectional Mamba

Error Analysis of Three-Layer Neural Network Trained with PGD for Deep Ritz Method

Comparisons Are All You Need for Optimizing Smooth Functions

Machine Learning & Wi-Fi: Unveiling the Path Towards AI/ML-Native IEEE 802.11 Networks

On the Convergence of No-Regret Dynamics in Information Retrieval Games with Proportional Ranking Functions

Global Convergence of Decentralized Retraction-Free Optimization on the Stiefel Manifold

How to integrate cloud service, data analytic and machine learning technique to reduce cyber risks associated with the modern cloud based infrastructure

Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection

Movie Revenue Prediction using Machine Learning Models

Interpretable Machine Learning Enhances Disease Prognosis: Applications on COVID-19 and Onward

Learning Regularities from Data using Spiking Functions: A Theory

Contactless Polysomnography: What Radio Waves Tell Us about Sleep

Fed-Credit: Robust Federated Learning with Credibility Management

Uncertainty of interpretability in Landslide Susceptibility Mapping: A Comparative Analysis of Statistical, Machine Learning, and Deep Learning Models

From SHAP Scores to Feature Importance Scores

LSEnet: Lorentz Structural Entropy Neural Network for Deep Graph Clustering

FedCAda: Adaptive Client-Side Optimization for Accelerated and Stable Federated Learning

Towards Graph Contrastive Learning: A Survey and Beyond

A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus

Out-of-Distribution Detection with a Single Unconditional Diffusion Model

Ensemble and Mixture-of-Experts DeepONets For Operator Learning

On Efficient and Statistical Quality Estimation for Data Annotation

Data Contamination Calibration for Black-box LLMs

Safe by Design Autonomous Driving Systems

Energy-Efficient Federated Edge Learning with Streaming Data: A Lyapunov Optimization Approach

GAN-GRID: A Novel Generative Attack on Smart Grid Stability Prediction

Channel Balance Interpolation in the Lightning Network via Machine Learning

An Active Learning Framework with a Class Balancing Strategy for Time Series Classification

Alzheimer's Magnetic Resonance Imaging Classification Using Deep and Meta-Learning Models

Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models