Showing new listings for Friday, 27 September 2024

Keyword: differential privacy

Immersion and Invariance-based Coding for Privacy-Preserving Federated Learning

Authors: Haleh Hayati, Carlos Murguia, Nathan van de Wouw
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17201
Pdf link: https://arxiv.org/pdf/2409.17201
Abstract Federated learning (FL) has emerged as a method to preserve privacy in collaborative distributed learning. In FL, clients train AI models directly on their devices rather than sharing data with a centralized server, which can pose privacy risks. However, it has been shown that despite FL's partial protection of local data privacy, information about clients' data can still be inferred from shared model updates during training. In recent years, several privacy-preserving approaches have been developed to mitigate this privacy leakage in FL, though they often provide privacy at the cost of model performance or system efficiency. Balancing these trade-offs presents a significant challenge in implementing FL schemes. In this manuscript, we introduce a privacy-preserving FL framework that combines differential privacy and system immersion tools from control theory. The core idea is to treat the optimization algorithms used in standard FL schemes (e.g., gradient-based algorithms) as a dynamical system that we seek to immerse into a higher-dimensional system (referred to as the target optimization algorithm). The target algorithm's dynamics are designed such that, first, the model parameters of the original algorithm are immersed in its parameters; second, it operates on distorted parameters; and third, it converges to an encoded version of the true model parameters from the original algorithm. These encoded parameters can then be decoded at the server to retrieve the original model parameters. We demonstrate that the proposed privacy-preserving scheme can be tailored to offer any desired level of differential privacy for both local and global model parameters, while maintaining the same accuracy and convergence rate as standard FL algorithms.
KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation
Authors: Anantaa Kotal, Anupam Joshi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17315
Pdf link: https://arxiv.org/pdf/2409.17315
Abstract The integration of privacy measures, including differential privacy techniques, ensures a provable privacy guarantee for the synthetic data. However, challenges arise for Generative Deep Learning models when tasked with generating realistic data, especially in critical domains such as Cybersecurity and Healthcare. Generative Models optimized for continuous data struggle to model discrete and non-Gaussian features that have domain constraints. Challenges increase when the training datasets are limited and not diverse. In such cases, generative models create synthetic data that repeats sensitive features, which is a privacy risk. Moreover, generative models face difficulties comprehending attribute constraints in specialized domains. This leads to the generation of unrealistic data that impacts downstream accuracy. To address these issues, this paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation. The novel framework augments the training of generative models with supplementary context about attribute values and enforces domain constraints during training. This added guidance enhances the model's capacity to generate realistic and domain-compliant synthetic data. The proposed model is evaluated on real-world datasets, specifically in the domains of Cybersecurity and Healthcare, where domain constraints and rules add to the complexity of the data. Our experiments evaluate the privacy resilience and downstream accuracy of the model against benchmark methods, demonstrating its effectiveness in addressing the balance between privacy preservation and data accuracy in complex domains.
Fully Dynamic Graph Algorithms with Edge Differential Privacy
Authors: Sofya Raskhodnikova, Teresa Anna Steiner
Subjects: Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17623
Pdf link: https://arxiv.org/pdf/2409.17623
Abstract We study differentially private algorithms for analyzing graphs in the challenging setting of continual release with fully dynamic updates, where edges are inserted and deleted over time, and the algorithm is required to update the solution at every time step. Previous work has presented differentially private algorithms for many graph problems that can handle insertions only or deletions only (called partially dynamic algorithms) and obtained some hardness results for the fully dynamic setting. The only algorithms in the latter setting were for the edge count, given by Fichtenberger, Henzinger, and Ost (ESA 21), and for releasing the values of all graph cuts, given by Fichtenberger, Henzinger, and Upadhyay (ICML 23). We provide the first differentially private and fully dynamic graph algorithms for several other fundamental graph statistics (including the triangle count, the number of connected components, the size of the maximum matching, and the degree histogram), analyze their error and show strong lower bounds on the error for all algorithms in this setting. We study two variants of edge differential privacy for fully dynamic graph algorithms: event-level and item-level. We give upper and lower bounds on the error of both event-level and item-level fully dynamic algorithms for several fundamental graph problems. No fully dynamic algorithms that are private at the item-level (the more stringent of the two notions) were known before. In the case of item-level privacy, for several problems, our algorithms match our lower bounds.
Slowly Scaling Per-Record Differential Privacy
Authors: Brian Finley, Anthony M Caruso, Justin C Doty, Ashwin Machanavajjhala, Mikaela R Meyer, David Pujol, William Sexton, Zachary Terner
Subjects: Subjects: Cryptography and Security (cs.CR); Methodology (stat.ME)
Arxiv link: https://arxiv.org/abs/2409.18118
Pdf link: https://arxiv.org/pdf/2409.18118
Abstract We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released. Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.
Keyword: privacy

Immersion and Invariance-based Coding for Privacy-Preserving Federated Learning
Authors: Haleh Hayati, Carlos Murguia, Nathan van de Wouw
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17201
Pdf link: https://arxiv.org/pdf/2409.17201
Abstract Federated learning (FL) has emerged as a method to preserve privacy in collaborative distributed learning. In FL, clients train AI models directly on their devices rather than sharing data with a centralized server, which can pose privacy risks. However, it has been shown that despite FL's partial protection of local data privacy, information about clients' data can still be inferred from shared model updates during training. In recent years, several privacy-preserving approaches have been developed to mitigate this privacy leakage in FL, though they often provide privacy at the cost of model performance or system efficiency. Balancing these trade-offs presents a significant challenge in implementing FL schemes. In this manuscript, we introduce a privacy-preserving FL framework that combines differential privacy and system immersion tools from control theory. The core idea is to treat the optimization algorithms used in standard FL schemes (e.g., gradient-based algorithms) as a dynamical system that we seek to immerse into a higher-dimensional system (referred to as the target optimization algorithm). The target algorithm's dynamics are designed such that, first, the model parameters of the original algorithm are immersed in its parameters; second, it operates on distorted parameters; and third, it converges to an encoded version of the true model parameters from the original algorithm. These encoded parameters can then be decoded at the server to retrieve the original model parameters. We demonstrate that the proposed privacy-preserving scheme can be tailored to offer any desired level of differential privacy for both local and global model parameters, while maintaining the same accuracy and convergence rate as standard FL algorithms.
SHEATH: Defending Horizontal Collaboration for Distributed CNNs against Adversarial Noise
Authors: Muneeba Asif, Mohammad Kumail Kazmi, Mohammad Ashiqur Rahman, Syed Rafay Hasan, Soamar Homsi
Subjects: Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2409.17279
Pdf link: https://arxiv.org/pdf/2409.17279
Abstract As edge computing and the Internet of Things (IoT) expand, horizontal collaboration (HC) emerges as a distributed data processing solution for resource-constrained devices. In particular, a convolutional neural network (CNN) model can be deployed on multiple IoT devices, allowing distributed inference execution for image recognition while ensuring model and data privacy. Yet, this distributed architecture remains vulnerable to adversaries who want to make subtle alterations that impact the model, even if they lack access to the entire model. Such vulnerabilities can have severe implications for various sectors, including healthcare, military, and autonomous systems. However, security solutions for these vulnerabilities have not been explored. This paper presents a novel framework for Secure Horizontal Edge with Adversarial Threat Handling (SHEATH) to detect adversarial noise and eliminate its effect on CNN inference by recovering the original feature maps. Specifically, SHEATH aims to address vulnerabilities without requiring complete knowledge of the CNN model in HC edge architectures based on sequential partitioning. It ensures data and model integrity, offering security against adversarial attacks in diverse HC environments. Our evaluations demonstrate SHEATH's adaptability and effectiveness across diverse CNN configurations.
Investigating Privacy Attacks in the Gray-Box Setting to Enhance Collaborative Learning Schemes
Authors: Federico Mazzone, Ahmad Al Badawi, Yuriy Polyakov, Maarten Everts, Florian Hahn, Andreas Peter
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17283
Pdf link: https://arxiv.org/pdf/2409.17283
Abstract The notion that collaborative machine learning can ensure privacy by just withholding the raw data is widely acknowledged to be flawed. Over the past seven years, the literature has revealed several privacy attacks that enable adversaries to extract information about a model's training dataset by exploiting access to model parameters during or after training. In this work, we study privacy attacks in the gray-box setting, where the attacker has only limited access - in terms of view and actions - to the model. The findings of our investigation provide new insights for the development of privacy-preserving collaborative learning solutions. We deploy SmartCryptNN, a framework that tailors homomorphic encryption to protect the portions of the model posing higher privacy risks. Our solution offers a trade-off between privacy and efficiency, which varies based on the extent and selection of the model components we choose to protect. We explore it on dense neural networks, where through extensive evaluation of diverse datasets and architectures, we uncover instances where a favorable sweet spot in the trade-off can be achieved by safeguarding only a single layer of the network. In one of such instances, our approach trains ~4 times faster compared to fully encrypted solutions, while reducing membership leakage by 17.8 times compared to plaintext solutions.
Blockchain-Enabled Variational Information Bottleneck for Data Extraction Based on Mutual Information in Internet of Vehicles
Authors: Cui Zhang, Wenjun Zhang, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen, Khaled B. Letaief
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17287
Pdf link: https://arxiv.org/pdf/2409.17287
Abstract The Internet of Vehicles (IoV) network can address the issue of limited computing resources and data processing capabilities of individual vehicles, but it also brings the risk of privacy leakage to vehicle users. Applying blockchain technology can establish secure data links within the IoV, solving the problems of insufficient computing resources for each vehicle and the security of data transmission over the network. However, with the development of the IoV, the amount of data interaction between multiple vehicles and between vehicles and base stations, roadside units, etc., is continuously increasing. There is a need to further reduce the interaction volume, and intelligent data compression is key to solving this problem. The VIB technique facilitates the training of encoding and decoding models, substantially diminishing the volume of data that needs to be transmitted. This paper introduces an innovative approach that integrates blockchain with VIB, referred to as BVIB, designed to lighten computational workloads and reinforce the security of the network. We first construct a new network framework by separating the encoding and decoding networks to address the computational burden issue, and then propose a new algorithm to enhance the security of IoV networks. We also discuss the impact of the data extraction rate on system latency to determine the most suitable data extraction rate. An experimental framework combining Python and C++ has been established to substantiate the efficacy of our BVIB approach. Comprehensive simulation studies indicate that the BVIB consistently excels in comparison to alternative foundational methodologies.
KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation
Authors: Anantaa Kotal, Anupam Joshi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17315
Pdf link: https://arxiv.org/pdf/2409.17315
Abstract The integration of privacy measures, including differential privacy techniques, ensures a provable privacy guarantee for the synthetic data. However, challenges arise for Generative Deep Learning models when tasked with generating realistic data, especially in critical domains such as Cybersecurity and Healthcare. Generative Models optimized for continuous data struggle to model discrete and non-Gaussian features that have domain constraints. Challenges increase when the training datasets are limited and not diverse. In such cases, generative models create synthetic data that repeats sensitive features, which is a privacy risk. Moreover, generative models face difficulties comprehending attribute constraints in specialized domains. This leads to the generation of unrealistic data that impacts downstream accuracy. To address these issues, this paper proposes a novel model, KIPPS, that infuses Domain and Regulatory Knowledge from Knowledge Graphs into Generative Deep Learning models for enhanced Privacy Preserving Synthetic data generation. The novel framework augments the training of generative models with supplementary context about attribute values and enforces domain constraints during training. This added guidance enhances the model's capacity to generate realistic and domain-compliant synthetic data. The proposed model is evaluated on real-world datasets, specifically in the domains of Cybersecurity and Healthcare, where domain constraints and rules add to the complexity of the data. Our experiments evaluate the privacy resilience and downstream accuracy of the model against benchmark methods, demonstrating its effectiveness in addressing the balance between privacy preservation and data accuracy in complex domains.
Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement
Authors: Haodong Li, Hao Lu, Ying-Cong Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2409.17316
Pdf link: https://arxiv.org/pdf/2409.17316
Abstract Remote photoplethysmography (rPPG) is gaining prominence for its non-invasive approach to monitoring physiological signals using only cameras. Despite its promise, the adaptability of rPPG models to new, unseen domains is hindered due to the environmental sensitivity of physiological signals. To address this, we pioneer the Test-Time Adaptation (TTA) in rPPG, enabling the adaptation of pre-trained models to the target domain during inference, sidestepping the need for annotations or source data due to privacy considerations. Particularly, utilizing only the user's face video stream as the accessible target domain data, the rPPG model is adjusted by tuning on each single instance it encounters. However, 1) TTA algorithms are designed predominantly for classification tasks, ill-suited in regression tasks such as rPPG due to inadequate supervision. 2) Tuning pre-trained models in a single-instance manner introduces variability and instability, posing challenges to effectively filtering domain-relevant from domain-irrelevant features while simultaneously preserving the learned information. To overcome these challenges, we present Bi-TTA, a novel expert knowledge-based Bidirectional Test-Time Adapter framework. Specifically, leveraging two expert-knowledge priors for providing self-supervision, our Bi-TTA primarily comprises two modules: a prospective adaptation (PA) module using sharpness-aware minimization to eliminate domain-irrelevant noise, enhancing the stability and efficacy during the adaptation process, and a retrospective stabilization (RS) module to dynamically reinforce crucial learned model parameters, averting performance degradation caused by overfitting or catastrophic forgetting. To this end, we established a large-scale benchmark for rPPG tasks under TTA protocol. The experimental results demonstrate the significant superiority of our approach over the state-of-the-art.
Does Worst-Performing Agent Lead the Pack? Analyzing Agent Dynamics in Unified Distributed SGD
Authors: Jie Hu, Yi-Ting Ma, Do Young Eun
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2409.17499
Pdf link: https://arxiv.org/pdf/2409.17499
Abstract Distributed learning is essential to train machine learning algorithms across heterogeneous agents while maintaining data privacy. We conduct an asymptotic analysis of Unified Distributed SGD (UD-SGD), exploring a variety of communication patterns, including decentralized SGD and local SGD within Federated Learning (FL), as well as the increasing communication interval in the FL setting. In this study, we assess how different sampling strategies, such as i.i.d. sampling, shuffling, and Markovian sampling, affect the convergence speed of UD-SGD by considering the impact of agent dynamics on the limiting covariance matrix as described in the Central Limit Theorem (CLT). Our findings not only support existing theories on linear speedup and asymptotic network independence, but also theoretically and empirically show how efficient sampling strategies employed by individual agents contribute to overall convergence in UD-SGD. Simulations reveal that a few agents using highly efficient sampling can achieve or surpass the performance of the majority employing moderately improved strategies, providing new insights beyond traditional analyses focusing on the worst-performing agent.
BioZero: An Efficient and Privacy-Preserving Decentralized Biometric Authentication Protocol on Open Blockchain
Authors: Junhao Lai, Taotao Wang, Shengli Zhang, Qing Yang, Soung Chang Liew
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17509
Pdf link: https://arxiv.org/pdf/2409.17509
Abstract Digital identity plays a vital role in enabling secure access to resources and services in the digital world. Traditional identity authentication methods, such as password-based and biometric authentications, have limitations in terms of security, privacy, and scalability. Decentralized authentication approaches leveraging blockchain technology have emerged as a promising solution. However, existing decentralized authentication methods often rely on indirect identity verification (e.g. using passwords or digital signatures as authentication credentials) and face challenges such as Sybil attacks. In this paper, we propose BioZero, an efficient and privacy-preserving decentralized biometric authentication protocol that can be implemented on open blockchain. BioZero leverages Pedersen commitment and homomorphic computation to protect user biometric privacy while enabling efficient verification. We enhance the protocol with non-interactive homomorphic computation and employ zero-knowledge proofs for secure on-chain verification. The unique aspect of BioZero is that it is fully decentralized and can be executed by blockchain smart contracts in a very efficient way. We analyze the security of BioZero and validate its performance through a prototype implementation. The results demonstrate the effectiveness, efficiency, and security of BioZero in decentralized authentication scenarios. Our work contributes to the advancement of decentralized identity authentication using biometrics.
Privacy-Preserving Redaction of Diagnosis Data through Source Code Analysis
Authors: Lixi Zhou, Lei Yu, Jia Zou, Hong Min
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17535
Pdf link: https://arxiv.org/pdf/2409.17535
Abstract Protecting sensitive information in diagnostic data such as logs, is a critical concern in the industrial software diagnosis and debugging process. While there are many tools developed to automatically redact the logs for identifying and removing sensitive information, they have severe limitations which can cause either over redaction and loss of critical diagnostic information (false positives), or disclosure of sensitive information (false negatives), or both. To address the problem, in this paper, we argue for a source code analysis approach for log redaction. To identify a log message containing sensitive information, our method locates the corresponding log statement in the source code with logger code augmentation, and checks if the log statement outputs data from sensitive sources by using the data flow graph built from the source code. Appropriate redaction rules are further applied depending on the sensitiveness of the data sources to preserve the privacy information in the logs. We conducted experimental evaluation and comparison with other popular baselines. The results demonstrate that our approach can significantly improve the detection precision of the sensitive information and reduce both false positives and negatives.
On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy
Authors: Saber Malekmohammadi, Golnoosh Farnadi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2409.17538
Pdf link: https://arxiv.org/pdf/2409.17538
Abstract A significant approach in natural language processing involves large-scale pre-training on general domain data followed by adaptation to specific tasks or domains. As models grow in size, full fine-tuning all parameters becomes increasingly impractical. To address this, some methods for low-rank task adaptation of language models have been proposed, e.g. LoRA and FLoRA. These methods keep the pre-trained model weights fixed and incorporate trainable low-rank decomposition matrices into some layers of the transformer architecture, called adapters. This approach significantly reduces the number of trainable parameters required for downstream tasks compared to full fine-tuning all parameters. In this work, we look at low-rank adaptation from the lens of data privacy. We show theoretically that the low-rank adaptation used in LoRA and FLoRA is equivalent to injecting some random noise into the batch gradients w.r.t the adapter parameters coming from their full fine-tuning, and we quantify the variance of the injected noise. By establishing a Berry-Esseen type bound on the total variation distance between the noise distribution and a Gaussian distribution with the same variance, we show that the dynamics of LoRA and FLoRA are very close to differentially private full fine-tuning the adapters, which suggests that low-rank adaptation implicitly provides privacy w.r.t the fine-tuning data. Finally, using Johnson-Lindenstrauss lemma, we show that when augmented with gradient clipping, low-rank adaptation is almost equivalent to differentially private full fine-tuning adapters with a fixed noise scale.
ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition
Authors: Shen Li, Jianqing Xu, Jiaying Wu, Miao Xiong, Ailin Deng, Jiazhen Ji, Yuge Huang, Wenjie Feng, Shouhong Ding, Bryan Hooi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2409.17576
Pdf link: https://arxiv.org/pdf/2409.17576
Abstract Synthetic face recognition (SFR) aims to generate synthetic face datasets that mimic the distribution of real face data, which allows for training face recognition models in a privacy-preserving manner. Despite the remarkable potential of diffusion models in image generation, current diffusion-based SFR models struggle with generalization to real-world faces. To address this limitation, we outline three key objectives for SFR: (1) promoting diversity across identities (inter-class diversity), (2) ensuring diversity within each identity by injecting various facial attributes (intra-class diversity), and (3) maintaining identity consistency within each identity group (intra-class identity preservation). Inspired by these goals, we introduce a diffusion-fueled SFR model termed $\text{ID}^3$. $\text{ID}^3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances. Theoretically, we show that minimizing this loss is equivalent to maximizing the lower bound of an adjusted conditional log-likelihood over ID-preserving data. This equivalence motivates an ID-preserving sampling algorithm, which operates over an adjusted gradient vector field, enabling the generation of fake face recognition datasets that approximate the distribution of real-world faces. Extensive experiments across five challenging benchmarks validate the advantages of $\text{ID}^3$.
Expanding Perspectives on Data Privacy: Insights from Rural Togo
Authors: Zoe Kahn, Meyebinesso Farida Carelle Pere, Emily Aiken, Nitin Kohli, Joshua E. Blumenstock
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2409.17578
Pdf link: https://arxiv.org/pdf/2409.17578
Abstract Passively collected "big" data sources are increasingly used to inform critical development policy decisions in low- and middle-income countries. While prior work highlights how such approaches may reveal sensitive information, enable surveillance, and centralize power, less is known about the corresponding privacy concerns, hopes, and fears of the people directly impacted by these policies -- people sometimes referred to as experiential experts. To understand the perspectives of experiential experts, we conducted semi-structured interviews with people living in rural villages in Togo shortly after an entirely digital cash transfer program was launched that used machine learning and mobile phone metadata to determine program eligibility. This paper documents participants' privacy concerns surrounding the introduction of big data approaches in development policy. We find that the privacy concerns of our experiential experts differ from those raised by privacy and development domain experts. To facilitate a more robust and constructive account of privacy, we discuss implications for policies and designs that take seriously the privacy concerns raised by both experiential experts and domain experts.
Multimodal Banking Dataset: Understanding Client Needs through Event Sequences
Authors: Mollaev Dzhambulat, Alexander Kostin, Postnova Maria, Ivan Karpukhin, Ivan A Kireev, Gleb Gusev, Andrey Savchenko
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2409.17587
Pdf link: https://arxiv.org/pdf/2409.17587
Abstract Financial organizations collect a huge amount of data about clients that typically has a temporal (sequential) structure and is collected from various sources (modalities). Due to privacy issues, there are no large-scale open-source multimodal datasets of event sequences, which significantly limits the research in this area. In this paper, we present the industrial-scale publicly available multimodal banking dataset, MBD, that contains more than 1.5M corporate clients with several modalities: 950M bank transactions, 1B geo position events, 5M embeddings of dialogues with technical support and monthly aggregated purchases of four bank's products. All entries are properly anonymized from real proprietary bank data. Using this dataset, we introduce a novel benchmark with two business tasks: campaigning (purchase prediction in the next month) and matching of clients. We provide numerical results that demonstrate the superiority of our multi-modal baselines over single-modal techniques for each task. As a result, the proposed dataset can open new perspectives and facilitate the future development of practically important large-scale multimodal algorithms for event sequences. HuggingFace Link: this https URL Github Link: this https URL
Fully Dynamic Graph Algorithms with Edge Differential Privacy
Authors: Sofya Raskhodnikova, Teresa Anna Steiner
Subjects: Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17623
Pdf link: https://arxiv.org/pdf/2409.17623
Abstract We study differentially private algorithms for analyzing graphs in the challenging setting of continual release with fully dynamic updates, where edges are inserted and deleted over time, and the algorithm is required to update the solution at every time step. Previous work has presented differentially private algorithms for many graph problems that can handle insertions only or deletions only (called partially dynamic algorithms) and obtained some hardness results for the fully dynamic setting. The only algorithms in the latter setting were for the edge count, given by Fichtenberger, Henzinger, and Ost (ESA 21), and for releasing the values of all graph cuts, given by Fichtenberger, Henzinger, and Upadhyay (ICML 23). We provide the first differentially private and fully dynamic graph algorithms for several other fundamental graph statistics (including the triangle count, the number of connected components, the size of the maximum matching, and the degree histogram), analyze their error and show strong lower bounds on the error for all algorithms in this setting. We study two variants of edge differential privacy for fully dynamic graph algorithms: event-level and item-level. We give upper and lower bounds on the error of both event-level and item-level fully dynamic algorithms for several fundamental graph problems. No fully dynamic algorithms that are private at the item-level (the more stringent of the two notions) were known before. In the case of item-level privacy, for several problems, our algorithms match our lower bounds.
AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure
Authors: Xi Chen, Zhiyang Zhang, Fangkai Yang, Xiaoting Qin, Chao Du, Xi Cheng, Hangxin Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2409.17642
Pdf link: https://arxiv.org/pdf/2409.17642
Abstract Large language model (LLM)-based AI delegates are increasingly utilized to act on behalf of users, assisting them with a wide range of tasks through conversational interfaces. Despite their advantages, concerns arise regarding the potential risk of privacy leaks, particularly in scenarios involving social interactions. While existing research has focused on protecting privacy by limiting the access of AI delegates to sensitive user information, many social scenarios require disclosing private details to achieve desired outcomes, necessitating a balance between privacy protection and disclosure. To address this challenge, we conduct a pilot study to investigate user preferences for AI delegates across various social relations and task scenarios, and then propose a novel AI delegate system that enables privacy-conscious self-disclosure. Our user study demonstrates that the proposed AI delegate strategically protects privacy, pioneering its use in diverse and dynamic social interactions.
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks
Authors: Giandomenico Cornacchia, Giulio Zizzo, Kieran Fraser, Muhammad Zaid Hamed, Ambrish Rawat, Mark Purcell
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17699
Pdf link: https://arxiv.org/pdf/2409.17699
Abstract The proliferation of Large Language Models (LLMs) in diverse applications underscores the pressing need for robust security measures to thwart potential jailbreak attacks. These attacks exploit vulnerabilities within LLMs, endanger data integrity and user privacy. Guardrails serve as crucial protective mechanisms against such threats, but existing models often fall short in terms of both detection accuracy, and computational efficiency. This paper advocates for the significance of jailbreak attack prevention on LLMs, and emphasises the role of input guardrails in safeguarding these models. We introduce MoJE (Mixture of Jailbreak Expert), a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails. By employing simple linguistic statistical techniques, MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference. Through rigorous experimentation, MoJE demonstrates superior performance capable of detecting 90% of the attacks without compromising benign prompts, enhancing LLMs security against jailbreak attacks.
Demystifying Privacy in 5G Stand Alone Networks
Authors: Stavros Eleftherakis, Timothy Otim, Giuseppe Santaromita, Almudena Diaz Zayas, Domenico Giustiniano, Nicolas Kourtellis
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2409.17700
Pdf link: https://arxiv.org/pdf/2409.17700
Abstract Ensuring user privacy remains critical in mobile networks, particularly with the rise of connected devices and denser 5G infrastructure. Privacy concerns have persisted across 2G, 3G, and 4G/LTE networks. Recognizing these concerns, the 3rd Generation Partnership Project (3GPP) has made privacy enhancements in 5G Release 15. However, the extent of operator adoption remains unclear, especially as most networks operate in 5G Non Stand Alone (NSA) mode, relying on 4G Core Networks. This study provides the first qualitative and experimental comparison between 5G NSA and Stand Alone (SA) in real operator networks, focusing on privacy enhancements addressing top eight pre-5G attacks based on recent academic literature. Additionally, it evaluates the privacy levels of OpenAirInterface (OAI), a leading open-source software for 5G, against real network deployments for the same attacks. The analysis reveals two new 5G privacy vulnerabilities, underscoring the need for further research and stricter standards.
TADAR: Thermal Array-based Detection and Ranging for Privacy-Preserving Human Sensing
Authors: Xie Zhang, Chenshu Wu
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2409.17742
Pdf link: https://arxiv.org/pdf/2409.17742
Abstract Human sensing has gained increasing attention in various applications. Among the available technologies, visual images offer high accuracy, while sensing on the RF spectrum preserves privacy, creating a conflict between imaging resolution and privacy preservation. In this paper, we explore thermal array sensors as an emerging modality that strikes an excellent resolution-privacy balance for ubiquitous sensing. To this end, we present TADAR, the first multi-user Thermal Array-based Detection and Ranging system that estimates the inherently missing range information, extending thermal array outputs from 2D thermal pixels to 3D depths and empowering them as a promising modality for ubiquitous privacy-preserving human sensing. We prototype TADAR using a single commodity thermal array sensor and conduct extensive experiments in different indoor environments. Our results show that TADAR achieves a mean F1 score of 88.8% for multi-user detection and a mean accuracy of 32.0 cm for multi-user ranging, which further improves to 20.1 cm for targets located within 3 m. We conduct two case studies on fall detection and occupancy estimation to showcase the potential applications of TADAR. We hope TADAR will inspire the vast community to explore new directions of thermal array sensing, beyond wireless and acoustic sensing. TADAR is open-sourced on GitHub: this https URL.
Privacy for Quantum Annealing. Attack on Spin Reversal Transformations in the case of cryptanalysis
Authors: Mateusz Leśniak, Michał Wroński
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17744
Pdf link: https://arxiv.org/pdf/2409.17744
Abstract This paper demonstrates that applying spin reversal transformations (SRT), commonly known as a sufficient method for privacy enhancing in problems solved using quantum annealing, does not guarantee privacy for all possible problems. We show how to recover the original problem from the Ising problem obtained using SRT when the resulting problem in Ising form represents the algebraic attack on the $E_0$ stream cipher. A small example is used to illustrate how to retrieve the original problem from the one transformed by SRT. Moreover, it is shown that our method is efficient even for full-scale problems.
Byzantine-Robust Aggregation for Securing Decentralized Federated Learning
Authors: Diego Cajaraville-Aboy, Ana Fernández-Vilas, Rebeca P. Díaz-Redondo, Manuel Fernández-Veiga
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2409.17754
Pdf link: https://arxiv.org/pdf/2409.17754
Abstract Federated Learning (FL) emerges as a distributed machine learning approach that addresses privacy concerns by training AI models locally on devices. Decentralized Federated Learning (DFL) extends the FL paradigm by eliminating the central server, thereby enhancing scalability and robustness through the avoidance of a single point of failure. However, DFL faces significant challenges in optimizing security, as most Byzantine-robust algorithms proposed in the literature are designed for centralized scenarios. In this paper, we present a novel Byzantine-robust aggregation algorithm to enhance the security of Decentralized Federated Learning environments, coined WFAgg. This proposal handles the adverse conditions and strength robustness of dynamic decentralized topologies at the same time by employing multiple filters to identify and mitigate Byzantine attacks. Experimental results demonstrate the effectiveness of the proposed algorithm in maintaining model accuracy and convergence in the presence of various Byzantine attack scenarios, outperforming state-of-the-art centralized Byzantine-robust aggregation schemes (such as Multi-Krum or Clustering). These algorithms are evaluated on an IID image classification problem in both centralized and decentralized scenarios.
Federated Learning under Attack: Improving Gradient Inversion for Batch of Images
Authors: Luiz Leite, Yuri Santo, Bruno L. Dalmazo, André Riker
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2409.17767
Pdf link: https://arxiv.org/pdf/2409.17767
Abstract Federated Learning (FL) has emerged as a machine learning approach able to preserve the privacy of user's data. Applying FL, clients train machine learning models on a local dataset and a central server aggregates the learned parameters coming from the clients, training a global machine learning model without sharing user's data. However, the state-of-the-art shows several approaches to promote attacks on FL systems. For instance, inverting or leaking gradient attacks can find, with high precision, the local dataset used during the training phase of the FL. This paper presents an approach, called Deep Leakage from Gradients with Feedback Blending (DLG-FB), which is able to improve the inverting gradient attack, considering the spatial correlation that typically exists in batches of images. The performed evaluation shows an improvement of 19.18% and 48,82% in terms of attack success rate and the number of iterations per attacked image, respectively.
Implementing a Nordic-Baltic Federated Health Data Network: a case report
Authors: Taridzo Chomutare, Aleksandar Babic, Laura-Maria Peltonen, Silja Elunurm, Peter Lundberg, Arne Jönsson, Emma Eneling, Ciprian-Virgil Gerstenberger, Troels Siggaard, Raivo Kolde, Oskar Jerdhaf, Martin Hansson, Alexandra Makhlysheva, Miroslav Muzny, Erik Ylipää, Søren Brunak, Hercules Dalianis
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17865
Pdf link: https://arxiv.org/pdf/2409.17865
Abstract Background: Centralized collection and processing of healthcare data across national borders pose significant challenges, including privacy concerns, data heterogeneity and legal barriers. To address some of these challenges, we formed an interdisciplinary consortium to develop a feder-ated health data network, comprised of six institutions across five countries, to facilitate Nordic-Baltic cooperation on secondary use of health data. The objective of this report is to offer early insights into our experiences developing this network. Methods: We used a mixed-method ap-proach, combining both experimental design and implementation science to evaluate the factors affecting the implementation of our network. Results: Technically, our experiments indicate that the network functions without significant performance degradation compared to centralized simu-lation. Conclusion: While use of interdisciplinary approaches holds a potential to solve challeng-es associated with establishing such collaborative networks, our findings turn the spotlight on the uncertain regulatory landscape playing catch up and the significant operational costs.
Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target Detection
Authors: Andrea Toaiari, Vittorio Murino, Marco Cristani, Cigdem Beyan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2409.17886
Pdf link: https://arxiv.org/pdf/2409.17886
Abstract Gaze Target Detection (GTD), i.e., determining where a person is looking within a scene from an external viewpoint, is a challenging task, particularly in 3D space. Existing approaches heavily rely on analyzing the person's appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach to tackle this problem by utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target. When predicted accurately, the human body pose can provide valuable information about the head pose, which is a good approximation of the gaze direction, as well as the position of the arms and hands, which are linked to the activity the person is performing and the objects they are likely focusing on. Consequently, in addition to performing gaze estimation in 3D, we are also able to perform GTD simultaneously. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset without requiring images of the person's face, thus promoting privacy preservation in various application contexts. The code is available at this https URL.
Adaptive Stream Processing on Edge Devices through Active Inference
Authors: Boris Sedlak, Victor Casamayor Pujol, Andrea Morichetta, Praveen Kumar Donta, Schahram Dustdar
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2409.17937
Pdf link: https://arxiv.org/pdf/2409.17937
Abstract The current scenario of IoT is witnessing a constant increase on the volume of data, which is generated in constant stream, calling for novel architectural and logical solutions for processing it. Moving the data handling towards the edge of the computing spectrum guarantees better distribution of load and, in principle, lower latency and better privacy. However, managing such a structure is complex, especially when requirements, also referred to Service Level Objectives (SLOs), specified by applications' owners and infrastructure managers need to be ensured. Despite the rich number of proposals of Machine Learning (ML) based management solutions, researchers and practitioners yet struggle to guarantee long-term prediction and control, and accurate troubleshooting. Therefore, we present a novel ML paradigm based on Active Inference (AIF) -- a concept from neuroscience that describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise. We implement it and evaluate it in a heterogeneous real stream processing use case, where an AIF-based agent continuously optimizes the fulfillment of three SLOs for three autonomous driving services running on multiple devices. The agent used causal knowledge to gradually develop an understanding of how its actions are related to requirements fulfillment, and which configurations to favor. Through this approach, our agent requires up to thirty iterations to converge to the optimal solution, showing the capability of offering accurate results in a short amount of time. Furthermore, thanks to AIF and its causal structures, our method guarantees full transparency on the decision making, making the interpretation of the results and the troubleshooting effortless.
Slowly Scaling Per-Record Differential Privacy
Authors: Brian Finley, Anthony M Caruso, Justin C Doty, Ashwin Machanavajjhala, Mikaela R Meyer, David Pujol, William Sexton, Zachary Terner
Subjects: Subjects: Cryptography and Security (cs.CR); Methodology (stat.ME)
Arxiv link: https://arxiv.org/abs/2409.18118
Pdf link: https://arxiv.org/pdf/2409.18118
Abstract We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released. Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.
Keyword: machine learning

Neural Network Architecture Search Enabled Wide-Deep Learning (NAS-WD) for Spatially Heterogenous Property Awared Chicken Woody Breast Classification and Hardness Regression
Authors: Chaitanya Pallerla, Yihong Feng, Casey M. Owens, Ramesh Bahadur Bist, Siavash Mahmoudi, Pouya Sohrabipour, Amirreza Davar, Dongyi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2409.17210
Pdf link: https://arxiv.org/pdf/2409.17210
Abstract Due to intensive genetic selection for rapid growth rates and high broiler yields in recent years, the global poultry industry has faced a challenging problem in the form of woody breast (WB) conditions. This condition has caused significant economic losses as high as $200 million annually, and the root cause of WB has yet to be identified. Human palpation is the most common method of distinguishing a WB from others. However, this method is time-consuming and subjective. Hyperspectral imaging (HSI) combined with machine learning algorithms can evaluate the WB conditions of fillets in a non-invasive, objective, and high-throughput manner. In this study, 250 raw chicken breast fillet samples (normal, mild, severe) were taken, and spatially heterogeneous hardness distribution was first considered when designing HSI processing models. The study not only classified the WB levels from HSI but also built a regression model to correlate the spectral information with sample hardness data. To achieve a satisfactory classification and regression model, a neural network architecture search (NAS) enabled a wide-deep neural network model named NAS-WD, which was developed. In NAS-WD, NAS was first used to automatically optimize the network architecture and hyperparameters. The classification results show that NAS-WD can classify the three WB levels with an overall accuracy of 95%, outperforming the traditional machine learning model, and the regression correlation between the spectral data and hardness was 0.75, which performs significantly better than traditional regression models.
Collaborative Comic Generation: Integrating Visual Narrative Theories with AI Models for Enhanced Creativity
Authors: Yi-Chun Chen, Arnav Jhala
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2409.17263
Pdf link: https://arxiv.org/pdf/2409.17263
Abstract This study presents a theory-inspired visual narrative generative system that integrates conceptual principles-comic authoring idioms-with generative and language models to enhance the comic creation process. Our system combines human creativity with AI models to support parts of the generative process, providing a collaborative platform for creating comic content. These comic-authoring idioms, derived from prior human-created image sequences, serve as guidelines for crafting and refining storytelling. The system translates these principles into system layers that facilitate comic creation through sequential decision-making, addressing narrative elements such as panel composition, story tension changes, and panel transitions. Key contributions include integrating machine learning models into the human-AI cooperative comic generation process, deploying abstract narrative theories into AI-driven comic creation, and a customizable tool for narrative-driven image sequences. This approach improves narrative elements in generated image sequences and engages human creativity in an AI-generative process of comics. We open-source the code at this https URL.
AAPM: Large Language Model Agent-based Asset Pricing Models
Authors: Junyan Cheng, Peter Chin
Subjects: Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2409.17266
Pdf link: https://arxiv.org/pdf/2409.17266
Abstract In this study, we propose a novel asset pricing approach, LLM Agent-based Asset Pricing Models (AAPM), which fuses qualitative discretionary investment analysis from LLM agents and quantitative manual financial economic factors to predict excess asset returns. The experimental results show that our approach outperforms machine learning-based asset pricing baselines in portfolio optimization and asset pricing errors. Specifically, the Sharpe ratio and average $|\alpha|$ for anomaly portfolios improved significantly by 9.6\% and 10.8\% respectively. In addition, we conducted extensive ablation studies on our model and analysis of the data to reveal further insights into the proposed method.
Model aggregation: minimizing empirical variance outperforms minimizing empirical error
Authors: Théo Bourdais, Houman Owhadi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2409.17267
Pdf link: https://arxiv.org/pdf/2409.17267
Abstract Whether deterministic or stochastic, models can be viewed as functions designed to approximate a specific quantity of interest. We propose a data-driven framework that aggregates predictions from diverse models into a single, more accurate output. This aggregation approach exploits each model's strengths to enhance overall accuracy. It is non-intrusive - treating models as black-box functions - model-agnostic, requires minimal assumptions, and can combine outputs from a wide range of models, including those from machine learning and numerical solvers. We argue that the aggregation process should be point-wise linear and propose two methods to find an optimal aggregate: Minimal Error Aggregation (MEA), which minimizes the aggregate's prediction error, and Minimal Variance Aggregation (MVA), which minimizes its variance. While MEA is inherently more accurate when correlations between models and the target quantity are perfectly known, Minimal Empirical Variance Aggregation (MEVA), an empirical version of MVA - consistently outperforms Minimal Empirical Error Aggregation (MEEA), the empirical counterpart of MEA, when these correlations must be estimated from data. The key difference is that MEVA constructs an aggregate by estimating model errors, while MEEA treats the models as features for direct interpolation of the quantity of interest. This makes MEEA more susceptible to overfitting and poor generalization, where the aggregate may underperform individual models during testing. We demonstrate the versatility and effectiveness of our framework in various applications, such as data science and partial differential equations, showing how it successfully integrates traditional solvers with machine learning models to improve both robustness and accuracy.
Building Real-time Awareness of Out-of-distribution in Trajectory Prediction for Autonomous Vehicles
Authors: Tongfei (Felicia)Guo, Taposh Banerjee, Rui Liu, Lili Su
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17277
Pdf link: https://arxiv.org/pdf/2409.17277
Abstract Trajectory prediction describes the motions of surrounding moving obstacles for an autonomous vehicle; it plays a crucial role in enabling timely decision-making, such as collision avoidance and trajectory replanning. Accurate trajectory planning is the key to reliable vehicle deployments in open-world environment, where unstructured obstacles bring in uncertainties that are impossible to fully capture by training data. For traditional machine learning tasks, such uncertainties are often addressed reasonably well via methods such as continual learning. On the one hand, naively applying those methods to trajectory prediction can result in continuous data collection and frequent model updates, which can be resource-intensive. On the other hand, the predicted trajectories can be far away from the true trajectories, leading to unsafe decision-making. In this paper, we aim to establish real-time awareness of out-of-distribution in trajectory prediction for autonomous vehicles. We focus on the challenging and practically relevant setting where the out-of-distribution is deceptive, that is, the one not easily detectable by human intuition. Drawing on the well-established techniques of sequential analysis, we build real-time awareness of out-of-distribution by monitoring prediction errors using the quickest change point detection (QCD). Our solutions are lightweight and can handle the occurrence of out-of-distribution at any time during trajectory prediction inference. Experimental results on multiple real-world datasets using a benchmark trajectory prediction model demonstrate the effectiveness of our methods.
Investigating Privacy Attacks in the Gray-Box Setting to Enhance Collaborative Learning Schemes
Authors: Federico Mazzone, Ahmad Al Badawi, Yuriy Polyakov, Maarten Everts, Florian Hahn, Andreas Peter
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2409.17283
Pdf link: https://arxiv.org/pdf/2409.17283
Abstract The notion that collaborative machine learning can ensure privacy by just withholding the raw data is widely acknowledged to be flawed. Over the past seven years, the literature has revealed several privacy attacks that enable adversaries to extract information about a model's training dataset by exploiting access to model parameters during or after training. In this work, we study privacy attacks in the gray-box setting, where the attacker has only limited access - in terms of view and actions - to the model. The findings of our investigation provide new insights for the development of privacy-preserving collaborative learning solutions. We deploy SmartCryptNN, a framework that tailors homomorphic encryption to protect the portions of the model posing higher privacy risks. Our solution offers a trade-off between privacy and efficiency, which varies based on the extent and selection of the model components we choose to protect. We explore it on dense neural networks, where through extensive evaluation of diverse datasets and architectures, we uncover instances where a favorable sweet spot in the trade-off can be achieved by safeguarding only a single layer of the network. In one of such instances, our approach trains ~4 times faster compared to fully encrypted solutions, while reducing membership leakage by 17.8 times compared to plaintext solutions.
Scalable quality control on processing of large diffusion-weighted and structural magnetic resonance imaging datasets
Authors: Michael E. Kim, Chenyu Gao, Karthik Ramadass, Praitayini Kanakaraj, Nancy R. Newlin, Gaurav Rudravaram, Kurt G. Schilling, Blake E. Dewey, David A. Bennett, Sid OBryant, Robert C. Barber, Derek Archer, Timothy J. Hohman, Shunxing Bao, Zhiyuan Li, Bennett A. Landman, Nazirah Mohd Khairi, The Alzheimers Disease Neuroimaging Initiative, The HABSHD Study Team
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2409.17286
Pdf link: https://arxiv.org/pdf/2409.17286
Abstract Proper quality control (QC) is time consuming when working with large-scale medical imaging datasets, yet necessary, as poor-quality data can lead to erroneous conclusions or poorly trained machine learning models. Most efforts to reduce data QC time rely on outlier detection, which cannot capture every instance of algorithm failure. Thus, there is a need to visually inspect every output of data processing pipelines in a scalable manner. We design a QC pipeline that allows for low time cost and effort across a team setting for a large database of diffusion weighted and structural magnetic resonance images. Our proposed method satisfies the following design criteria: 1.) a consistent way to perform and manage quality control across a team of researchers, 2.) quick visualization of preprocessed data that minimizes the effort and time spent on the QC process without compromising the condition or caliber of the QC, and 3.) a way to aggregate QC results across pipelines and datasets that can be easily shared. In addition to meeting these design criteria, we also provide information on what a successful output should be and common occurrences of algorithm failures for various processing pipelines. Our method reduces the time spent on QC by a factor of over 20 when compared to naively opening outputs in an image viewer and demonstrate how it can facilitate aggregation and sharing of QC results within a team. While researchers must spend time on robust visual QC of data, there are mechanisms by which the process can be streamlined and efficient.
The poison of dimensionality
Authors: Lê-Nguyên Hoang
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2409.17328
Pdf link: https://arxiv.org/pdf/2409.17328
Abstract This paper advances the understanding of how the size of a machine learning model affects its vulnerability to poisoning, despite state-of-the-art defenses. Given isotropic random honest feature vectors and the geometric median (or clipped mean) as the robust gradient aggregator rule, we essentially prove that, perhaps surprisingly, linear and logistic regressions with $D \geq 169 H^2/P^2$ parameters are subject to arbitrary model manipulation by poisoners, where $H$ and $P$ are the numbers of honestly labeled and poisoned data points used for training. Our experiments go on exposing a fundamental tradeoff between augmenting model expressivity and increasing the poisoners' attack surface, on both synthetic data, and on MNIST & FashionMNIST data for linear classifiers with random features. We also discuss potential implications for source-based learning and neural nets.
Non-asymptotic Convergence of Training Transformers for Next-token Prediction
Authors: Ruiquan Huang, Yingbin Liang, Jing Yang
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2409.17335
Pdf link: https://arxiv.org/pdf/2409.17335
Abstract Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data, especially in next-token prediction (NTP) tasks. However, the theoretical understanding of their performance in NTP is limited, with existing studies focusing mainly on asymptotic performance. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer consisting of a self-attention module followed by a feed-forward layer. We first characterize the essential structural properties of training datasets for NTP using a mathematical framework based on partial orders. Then, we design a two-stage training algorithm, where the pre-processing stage for training the feed-forward layer and the main stage for training the attention layer exhibit fast convergence performance. Specifically, both layers converge sub-linearly to the direction of their corresponding max-margin solutions. We also show that the cross-entropy loss enjoys a linear convergence rate. Furthermore, we show that the trained transformer presents non-trivial prediction ability with dataset shift, which sheds light on the remarkable generalization performance of transformers. Our analysis technique involves the development of novel properties on the attention gradient and further in-depth analysis of how these properties contribute to the convergence of the training process. Our experiments further validate our theoretical findings.
The Technology of Outrage: Bias in Artificial Intelligence
Authors: Will Bridewell, Paul F. Bello, Selmer Bringsjord
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2409.17336
Pdf link: https://arxiv.org/pdf/2409.17336
Abstract Artificial intelligence and machine learning are increasingly used to offload decision making from people. In the past, one of the rationales for this replacement was that machines, unlike people, can be fair and unbiased. Evidence suggests otherwise. We begin by entertaining the ideas that algorithms can replace people and that algorithms cannot be biased. Taken as axioms, these statements quickly lead to absurdity. Spurred on by this result, we investigate the slogans more closely and identify equivocation surrounding the word 'bias.' We diagnose three forms of outrage-intellectual, moral, and political-that are at play when people react emotionally to algorithmic bias. Then we suggest three practical approaches to addressing bias that the AI community could take, which include clarifying the language around bias, developing new auditing methods for intelligent systems, and building certain capabilities into these systems. We conclude by offering a moral regarding the conversations about algorithmic bias that may transfer to other areas of artificial intelligence.
data2lang2vec: Data Driven Typological Features Completion
Authors: Hamidreza Amirzadeh, Sadegh Jafari, Anika Harju, Rob van der Goot
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2409.17373
Pdf link: https://arxiv.org/pdf/2409.17373
Abstract Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9\%. Previous work on automatically increasing coverage predicts missing values based on features from other languages or focuses on single features, we propose to use textual data for better-informed feature prediction. To this end, we introduce a multi-lingual Part-of-Speech (POS) tagger, achieving over 70\% accuracy across 1,749 languages, and experiment with external statistical features and a variety of machine learning algorithms. We also introduce a more realistic evaluation setup, focusing on likely to be missing typology features, and show that our approach outperforms previous work in both setups.
AI Enabled Neutron Flux Measurement and Virtual Calibration in Boiling Water Reactors
Authors: Anirudh Tunga, Jordan Heim, Michael Mueterthies, Thomas Gruenwald, Jonathan Nistor
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17405
Pdf link: https://arxiv.org/pdf/2409.17405
Abstract Accurately capturing the three dimensional power distribution within a reactor core is vital for ensuring the safe and economical operation of the reactor, compliance with Technical Specifications, and fuel cycle planning (safety, control, and performance evaluation). Offline (that is, during cycle planning and core design), a three dimensional neutronics simulator is used to estimate the reactor's power, moderator, void, and flow distributions, from which margin to thermal limits and fuel exposures can be approximated. Online, this is accomplished with a system of local power range monitors (LPRMs) designed to capture enough neutron flux information to infer the full nodal power distribution. Certain problems with this process, ranging from measurement and calibration to the power adaption process, pose challenges to operators and limit the ability to design reload cores economically (e.g., engineering in insufficient margin or more margin than required). Artificial intelligence (AI) and machine learning (ML) are being used to solve the problems to reduce maintenance costs, improve the accuracy of online local power measurements, and decrease the bias between offline and online power distributions, thereby leading to a greater ability to design safe and economical reload cores. We present ML models trained from two deep neural network (DNN) architectures, SurrogateNet and LPRMNet, that demonstrate a testing error of 1 percent and 3 percent, respectively. Applications of these models can include virtual sensing capability for bypassed or malfunctioning LPRMs, on demand virtual calibration of detectors between successive calibrations, highly accurate nuclear end of life determinations for LPRMs, and reduced bias between measured and predicted power distributions within the core.
Does Worst-Performing Agent Lead the Pack? Analyzing Agent Dynamics in Unified Distributed SGD
Authors: Jie Hu, Yi-Ting Ma, Do Young Eun
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2409.17499
Pdf link: https://arxiv.org/pdf/2409.17499
Abstract Distributed learning is essential to train machine learning algorithms across heterogeneous agents while maintaining data privacy. We conduct an asymptotic analysis of Unified Distributed SGD (UD-SGD), exploring a variety of communication patterns, including decentralized SGD and local SGD within Federated Learning (FL), as well as the increasing communication interval in the FL setting. In this study, we assess how different sampling strategies, such as i.i.d. sampling, shuffling, and Markovian sampling, affect the convergence speed of UD-SGD by considering the impact of agent dynamics on the limiting covariance matrix as described in the Central Limit Theorem (CLT). Our findings not only support existing theories on linear speedup and asymptotic network independence, but also theoretically and empirically show how efficient sampling strategies employed by individual agents contribute to overall convergence in UD-SGD. Simulations reveal that a few agents using highly efficient sampling can achieve or surpass the performance of the majority employing moderately improved strategies, providing new insights beyond traditional analyses focusing on the worst-performing agent.
Functional Classification of Spiking Signal Data Using Artificial Intelligence Techniques: A Review
Authors: Danial Sharifrazi, Nouman Javed, Javad Hassannataj Joloudari, Roohallah Alizadehsani, Prasad N. Paradkar, Ru-San Tan, U. Rajendra Acharya, Asim Bhatti
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2409.17516
Pdf link: https://arxiv.org/pdf/2409.17516
Abstract Human brain neuron activities are incredibly significant nowadays. Neuronal behavior is assessed by analyzing signal data such as electroencephalography (EEG), which can offer scientists valuable information about diseases and human-computer interaction. One of the difficulties researchers confront while evaluating these signals is the existence of large volumes of spike data. Spikes are some considerable parts of signal data that can happen as a consequence of vital biomarkers or physical issues such as electrode movements. Hence, distinguishing types of spikes is important. From this spot, the spike classification concept commences. Previously, researchers classified spikes manually. The manual classification was not precise enough as it involves extensive analysis. Consequently, Artificial Intelligence (AI) was introduced into neuroscience to assist clinicians in classifying spikes correctly. This review discusses the importance and use of AI in spike classification, focusing on the recognition of neural activity noises. The task is divided into three main components: preprocessing, classification, and evaluation. Existing methods are introduced and their importance is determined. The review also highlights the need for more efficient algorithms. The primary goal is to provide a perspective on spike classification for future research and provide a comprehensive understanding of the methodologies and issues involved. The review organizes materials in the spike classification field for future studies. In this work, numerous studies were extracted from different databases. The PRISMA-related research guidelines were then used to choose papers. Then, research studies based on spike classification using machine learning and deep learning approaches with effective preprocessing were selected.
Expanding Perspectives on Data Privacy: Insights from Rural Togo
Authors: Zoe Kahn, Meyebinesso Farida Carelle Pere, Emily Aiken, Nitin Kohli, Joshua E. Blumenstock
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2409.17578
Pdf link: https://arxiv.org/pdf/2409.17578
Abstract Passively collected "big" data sources are increasingly used to inform critical development policy decisions in low- and middle-income countries. While prior work highlights how such approaches may reveal sensitive information, enable surveillance, and centralize power, less is known about the corresponding privacy concerns, hopes, and fears of the people directly impacted by these policies -- people sometimes referred to as experiential experts. To understand the perspectives of experiential experts, we conducted semi-structured interviews with people living in rural villages in Togo shortly after an entirely digital cash transfer program was launched that used machine learning and mobile phone metadata to determine program eligibility. This paper documents participants' privacy concerns surrounding the introduction of big data approaches in development policy. We find that the privacy concerns of our experiential experts differ from those raised by privacy and development domain experts. To facilitate a more robust and constructive account of privacy, we discuss implications for policies and designs that take seriously the privacy concerns raised by both experiential experts and domain experts.
Provable Performance Guarantees of Copy Detection Patterns
Authors: Joakim Tutt, Slava Voloshynovskiy
Subjects: Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2409.17649
Pdf link: https://arxiv.org/pdf/2409.17649
Abstract Copy Detection Patterns (CDPs) are crucial elements in modern security applications, playing a vital role in safeguarding industries such as food, pharmaceuticals, and cosmetics. Current performance evaluations of CDPs predominantly rely on empirical setups using simplistic metrics like Hamming distances or Pearson correlation. These methods are often inadequate due to their sensitivity to distortions, degradation, and their limitations to stationary statistics of printing and imaging. Additionally, machine learning-based approaches suffer from distribution biases and fail to generalize to unseen counterfeit samples. Given the critical importance of CDPs in preventing counterfeiting, including the counterfeit vaccines issue highlighted during the COVID-19 pandemic, there is an urgent need for provable performance guarantees across various criteria. This paper aims to establish a theoretical framework to derive optimal criteria for the analysis, optimization, and future development of CDP authentication technologies, ensuring their reliability and effectiveness in diverse security scenarios.
Optimal Memorization Capacity of Transformers
Authors: Tokio Kajitsuka, Issei Sato
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17677
Pdf link: https://arxiv.org/pdf/2409.17677
Abstract Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with $\tilde{O}(\sqrt{N})$ parameters in a next-token prediction setting for $N$ input sequences of length $n$, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length $n$ owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that $\tilde{O}(\sqrt{nN})$ parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.
Artificial Data Point Generation in Clustered Latent Space for Small Medical Datasets
Authors: Yasaman Haghbin, Hadi Moradi, Reshad Hosseini
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17685
Pdf link: https://arxiv.org/pdf/2409.17685
Abstract One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many medical applications, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL), designed to enhance classification performance on small medical datasets through synthetic data generation. The AGCL framework involves feature extraction, K-means clustering, cluster evaluation based on a class separation metric, and the generation of synthetic data points from clusters with distinct class representations. This method was applied to Parkinson's disease screening, utilizing facial expression data, and evaluated across multiple machine learning classifiers. Experimental results demonstrate that AGCL significantly improves classification accuracy compared to baseline, GN and kNNMTD. AGCL achieved the highest overall test accuracy of 83.33% and cross-validation accuracy of 90.90% in majority voting over different emotions, confirming its effectiveness in augmenting small datasets.
Recent advances in interpretable machine learning using structure-based protein representations
Authors: Luiz Felipe Vecchietti, Minji Lee, Begench Hangeldiyev, Hyunkyu Jung, Hahnbeom Park, Tae-Kyun Kim, Meeyoung Cha, Ho Min Kim
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17726
Pdf link: https://arxiv.org/pdf/2409.17726
Abstract Recent advancements in machine learning (ML) are transforming the field of structural biology. For example, AlphaFold, a groundbreaking neural network for protein structure prediction, has been widely adopted by researchers. The availability of easy-to-use interfaces and interpretable outcomes from the neural network architecture, such as the confidence scores used to color the predicted structures, have made AlphaFold accessible even to non-ML experts. In this paper, we present various methods for representing protein 3D structures from low- to high-resolution, and show how interpretable ML methods can support tasks such as predicting protein structures, protein function, and protein-protein interactions. This survey also emphasizes the significance of interpreting and visualizing ML-based inference for structure-based protein representations that enhance interpretability and knowledge discovery. Developing such interpretable approaches promises to further accelerate fields including drug development and protein design.
Byzantine-Robust Aggregation for Securing Decentralized Federated Learning
Authors: Diego Cajaraville-Aboy, Ana Fernández-Vilas, Rebeca P. Díaz-Redondo, Manuel Fernández-Veiga
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2409.17754
Pdf link: https://arxiv.org/pdf/2409.17754
Abstract Federated Learning (FL) emerges as a distributed machine learning approach that addresses privacy concerns by training AI models locally on devices. Decentralized Federated Learning (DFL) extends the FL paradigm by eliminating the central server, thereby enhancing scalability and robustness through the avoidance of a single point of failure. However, DFL faces significant challenges in optimizing security, as most Byzantine-robust algorithms proposed in the literature are designed for centralized scenarios. In this paper, we present a novel Byzantine-robust aggregation algorithm to enhance the security of Decentralized Federated Learning environments, coined WFAgg. This proposal handles the adverse conditions and strength robustness of dynamic decentralized topologies at the same time by employing multiple filters to identify and mitigate Byzantine attacks. Experimental results demonstrate the effectiveness of the proposed algorithm in maintaining model accuracy and convergence in the presence of various Byzantine attack scenarios, outperforming state-of-the-art centralized Byzantine-robust aggregation schemes (such as Multi-Krum or Clustering). These algorithms are evaluated on an IID image classification problem in both centralized and decentralized scenarios.
Federated Learning under Attack: Improving Gradient Inversion for Batch of Images
Authors: Luiz Leite, Yuri Santo, Bruno L. Dalmazo, André Riker
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2409.17767
Pdf link: https://arxiv.org/pdf/2409.17767
Abstract Federated Learning (FL) has emerged as a machine learning approach able to preserve the privacy of user's data. Applying FL, clients train machine learning models on a local dataset and a central server aggregates the learned parameters coming from the clients, training a global machine learning model without sharing user's data. However, the state-of-the-art shows several approaches to promote attacks on FL systems. For instance, inverting or leaking gradient attacks can find, with high precision, the local dataset used during the training phase of the FL. This paper presents an approach, called Deep Leakage from Gradients with Feedback Blending (DLG-FB), which is able to improve the inverting gradient attack, considering the spatial correlation that typically exists in batches of images. The performed evaluation shows an improvement of 19.18% and 48,82% in terms of attack success rate and the number of iterations per attacked image, respectively.
Predicting the Stay Length of Patients in Hospitals using Convolutional Gated Recurrent Deep Learning Model
Authors: Mehdi Neshat, Michael Phipps, Chris A. Browne, Nicole T. Vargas, Seyedali Mirjalili
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17786
Pdf link: https://arxiv.org/pdf/2409.17786
Abstract Predicting hospital length of stay (LoS) stands as a critical factor in shaping public health strategies. This data serves as a cornerstone for governments to discern trends, patterns, and avenues for enhancing healthcare delivery. In this study, we introduce a robust hybrid deep learning model, a combination of Multi-layer Convolutional (CNNs) deep learning, Gated Recurrent Units (GRU), and Dense neural networks, that outperforms 11 conventional and state-of-the-art Machine Learning (ML) and Deep Learning (DL) methodologies in accurately forecasting inpatient hospital stay duration. Our investigation delves into the implementation of this hybrid model, scrutinising variables like geographic indicators tied to caregiving institutions, demographic markers encompassing patient ethnicity, race, and age, as well as medical attributes such as the CCS diagnosis code, APR DRG code, illness severity metrics, and hospital stay duration. Statistical evaluations reveal the pinnacle LoS accuracy achieved by our proposed model (CNN-GRU-DNN), which averages at 89% across a 10-fold cross-validation test, surpassing LSTM, BiLSTM, GRU, and Convolutional Neural Networks (CNNs) by 19%, 18.2%, 18.6%, and 7%, respectively. Accurate LoS predictions not only empower hospitals to optimise resource allocation and curb expenses associated with prolonged stays but also pave the way for novel strategies in hospital stay management. This avenue holds promise for catalysing advancements in healthcare research and innovation, inspiring a new era of precision-driven healthcare practices.
Bias Assessment and Data Drift Detection in Medical Image Analysis: A Survey
Authors: Andrea Prenner, Bernhard Kainz
Subjects: Subjects: Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2409.17800
Pdf link: https://arxiv.org/pdf/2409.17800
Abstract Machine Learning (ML) models have gained popularity in medical imaging analysis given their expert level performance in many medical domains. To enhance the trustworthiness, acceptance, and regulatory compliance of medical imaging models and to facilitate their integration into clinical settings, we review and categorise methods for ensuring ML reliability, both during development and throughout the model's lifespan. Specifically, we provide an overview of methods assessing models' inner-workings regarding bias encoding and detection of data drift for disease classification models. Additionally, to evaluate the severity in case of a significant drift, we provide an overview of the methods developed for classifier accuracy estimation in case of no access to ground truth labels. This should enable practitioners to implement methods ensuring reliable ML deployment and consistent prediction performance over time.
Sentiment Analysis of ML Projects: Bridging Emotional Intelligence and Code Quality
Authors: Md Shoaib Ahmed, Dongyoung Park, Nasir U. Eisty
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2409.17885
Pdf link: https://arxiv.org/pdf/2409.17885
Abstract This study explores the intricate relationship between sentiment analysis (SA) and code quality within machine learning (ML) projects, illustrating how the emotional dynamics of developers affect the technical and functional attributes of software projects. Recognizing the vital role of developer sentiments, this research employs advanced sentiment analysis techniques to scrutinize affective states from textual interactions such as code comments, commit messages, and issue discussions within high-profile ML projects. By integrating a comprehensive dataset of popular ML repositories, this analysis applies a blend of rule-based, machine learning, and hybrid sentiment analysis methodologies to systematically quantify sentiment scores. The emotional valence expressed by developers is then correlated with a spectrum of code quality indicators, including the prevalence of bugs, vulnerabilities, security hotspots, code smells, and duplication instances. Findings from this study distinctly illustrate that positive sentiments among developers are strongly associated with superior code quality metrics manifested through reduced bugs and lower incidence of code smells. This relationship underscores the importance of fostering positive emotional environments to enhance productivity and code craftsmanship. Conversely, the analysis reveals that negative sentiments correlate with an uptick in code issues, particularly increased duplication and heightened security risks, pointing to the detrimental effects of adverse emotional conditions on project health.
A multi-source data power load forecasting method using attention mechanism-based parallel cnn-gru
Authors: Chao Min, Yijia Wang, Bo Zhang, Xin Ma, Junyi Cui
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17889
Pdf link: https://arxiv.org/pdf/2409.17889
Abstract Accurate power load forecasting is crucial for improving energy efficiency and ensuring power supply quality. Considering the power load forecasting problem involves not only dynamic factors like historical load variations but also static factors such as climate conditions that remain constant over specific periods. From the model-agnostic perspective, this paper proposes a parallel structure network to extract important information from both dynamic and static data. Firstly, based on complexity learning theory, it is demonstrated that models integrated through parallel structures exhibit superior generalization abilities compared to individual base learners. Additionally, the higher the independence between base learners, the stronger the generalization ability of the parallel structure model. This suggests that the structure of machine learning models inherently contains significant information. Building on this theoretical foundation, a parallel convolutional neural network (CNN)-gate recurrent unit (GRU) attention model (PCGA) is employed to address the power load forecasting issue, aiming to effectively integrate the influences of dynamic and static features. The CNN module is responsible for capturing spatial characteristics from static data, while the GRU module captures long-term dependencies in dynamic time series data. The attention layer is designed to focus on key information from the spatial-temporal features extracted by the parallel CNN-GRU. To substantiate the advantages of the parallel structure model in extracting and integrating multi-source information, a series of experiments are conducted.
Designing Short-Stage CDC-XPUFs: Balancing Reliability, Cost, and Security in IoT Devices
Authors: Gaoxiang Li, Yu Zhuang
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17902
Pdf link: https://arxiv.org/pdf/2409.17902
Abstract The rapid expansion of Internet of Things (IoT) devices demands robust and resource-efficient security solutions. Physically Unclonable Functions (PUFs), which generate unique cryptographic keys from inherent hardware variations, offer a promising approach. However, traditional PUFs like Arbiter PUFs (APUFs) and XOR Arbiter PUFs (XOR-PUFs) are susceptible to machine learning (ML) and reliability-based attacks. In this study, we investigate Component-Differentially Challenged XOR-PUFs (CDC-XPUFs), a less explored variant, to address these vulnerabilities. We propose an optimized CDC-XPUF design that incorporates a pre-selection strategy to enhance reliability and introduces a novel lightweight architecture to reduce hardware overhead. Rigorous testing demonstrates that our design significantly lowers resource consumption, maintains strong resistance to ML attacks, and improves reliability, effectively mitigating reliability-based attacks. These results highlight the potential of CDC-XPUFs as a secure and efficient candidate for widespread deployment in resource-constrained IoT systems.
Intelligent Energy Management: Remaining Useful Life Prediction and Charging Automation System Comprised of Deep Learning and the Internet of Things
Authors: Biplov Paneru, Bishwash Paneru, DP Sharma Mainali
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2409.17931
Pdf link: https://arxiv.org/pdf/2409.17931
Abstract Remaining Useful Life (RUL) of battery is an important parameter to know the battery's remaining life and need for recharge. The goal of this research project is to develop machine learning-based models for the battery RUL dataset. Different ML models are developed to classify the RUL of the vehicle, and the IoT (Internet of Things) concept is simulated for automating the charging system and managing any faults aligning. The graphs plotted depict the relationship between various vehicle parameters using the Blynk IoT platform. Results show that the catboost, Multi-Layer Perceptron (MLP), Gated Recurrent Unit (GRU), and hybrid model developed could classify RUL into three classes with 99% more accuracy. The data is fed using the tkinter GUI for simulating artificial intelligence (AI)-based charging, and with a pyserial backend, data can be entered into the Esp-32 microcontroller for making charge discharge possible with the model's predictions. Also, with an IoT system, the charging can be disconnected, monitored, and analyzed for automation. The results show that an accuracy of 99% can be obtained on models MLP, catboost model and similar accuracy on GRU model can be obtained, and finally relay-based triggering can be made by prediction through the model used for automating the charging and energy-saving mechanism. By showcasing an exemplary Blynk platform-based monitoring and automation phenomenon, we further present innovative ways of monitoring parameters and automating the system.
Adaptive Stream Processing on Edge Devices through Active Inference
Authors: Boris Sedlak, Victor Casamayor Pujol, Andrea Morichetta, Praveen Kumar Donta, Schahram Dustdar
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2409.17937
Pdf link: https://arxiv.org/pdf/2409.17937
Abstract The current scenario of IoT is witnessing a constant increase on the volume of data, which is generated in constant stream, calling for novel architectural and logical solutions for processing it. Moving the data handling towards the edge of the computing spectrum guarantees better distribution of load and, in principle, lower latency and better privacy. However, managing such a structure is complex, especially when requirements, also referred to Service Level Objectives (SLOs), specified by applications' owners and infrastructure managers need to be ensured. Despite the rich number of proposals of Machine Learning (ML) based management solutions, researchers and practitioners yet struggle to guarantee long-term prediction and control, and accurate troubleshooting. Therefore, we present a novel ML paradigm based on Active Inference (AIF) -- a concept from neuroscience that describes how the brain constantly predicts and evaluates sensory information to decrease long-term surprise. We implement it and evaluate it in a heterogeneous real stream processing use case, where an AIF-based agent continuously optimizes the fulfillment of three SLOs for three autonomous driving services running on multiple devices. The agent used causal knowledge to gradually develop an understanding of how its actions are related to requirements fulfillment, and which configurations to favor. Through this approach, our agent requires up to thirty iterations to converge to the optimal solution, showing the capability of offering accurate results in a short amount of time. Furthermore, thanks to AIF and its causal structures, our method guarantees full transparency on the decision making, making the interpretation of the results and the troubleshooting effortless.
Predicting Anchored Text from Translation Memories for Machine Translation Using Deep Learning Methods
Authors: Richard Yue, John E. Ortega
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2409.17939
Pdf link: https://arxiv.org/pdf/2409.17939
Abstract Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s'). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the TM that are close in distance to s'. After locating two similar segments, the CAT tool will present parallel segments (s, t) that contain one segment in the source language along with its translation in the target language. Additionally, CAT tools contain fuzzy-match repair (FMR) techniques that will automatically use the parallel segments from the TM to create new TM entries containing a modified version of the original with the idea in mind that it will be the translation of s'. Most FMR techniques use machine translation as a way of "repairing" those words that have to be modified. In this article, we show that for a large part of those words which are anchored, we can use other techniques that are based on machine learning approaches such as Word2Vec. BERT, and even ChatGPT. Specifically, we show that for anchored words that follow the continuous bag-of-words (CBOW) paradigm, Word2Vec, BERT, and GPT-4 can be used to achieve similar and, for some cases, better results than neural machine translation for translating anchored words from French to English.
MMDVS-LF: A Multi-Modal Dynamic-Vision-Sensor Line Following Dataset
Authors: Felix Resch, Mónika Farsang, Radu Grosu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2409.18038
Pdf link: https://arxiv.org/pdf/2409.18038
Abstract Dynamic Vision Sensors (DVS), offer a unique advantage in control applications, due to their high temporal resolution, and asynchronous event-based data. Still, their adoption in machine learning algorithms remains limited. To address this gap, and promote the development of models that leverage the specific characteristics of DVS data, we introduce the Multi-Modal Dynamic-Vision-Sensor Line Following dataset (MMDVS-LF). This comprehensive dataset, is the first to integrate multiple sensor modalities, including DVS recordings, RGB video, odometry, and Inertial Measurement Unit (IMU) data, from a small-scale standardized vehicle. Additionally, the dataset includes eye-tracking and demographic data of drivers performing a Line Following task on a track. With its diverse range of data, MMDVS-LF opens new opportunities for developing deep learning algorithms, and conducting data science projects across various domains, supporting innovation in autonomous systems and control applications.
Next-Gen Software Engineering: AI-Assisted Big Models
Authors: Ina K. Schieferdecker
Subjects: Subjects: Software Engineering (cs.SE); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2409.18048
Pdf link: https://arxiv.org/pdf/2409.18048
Abstract The effectiveness of model-driven software engineering (MDSE) has been demonstrated in the context of complex software; however, it has not been widely adopted due to the requisite efforts associated with model development and maintenance, as well as the specific modelling competencies required for MDSE. Concurrently, artificial intelligence (AI) methods, particularly machine learning (ML) methods, have demonstrated considerable abilities when applied to the huge code bases accessible on open-source coding platforms. The so-called big code provides the basis for significant advances in empirical software engineering, as well as in the automation of coding processes and improvements in software quality with the use of AI. The objective of this paper is to facilitate a synthesis between these two significant domains of software engineering (SE), namely models and AI in SE. The paper provides an overview of the current status of AI-assisted software engineering. In light of the aforementioned considerations, a vision of AI-assisted Big Models in SE is put forth, with the aim of capitalising on the advantages inherent to both approaches in the context of software development. Finally, the new paradigm of pair modelling in MDSE is proposed.
Explaining Explaining
Authors: Sergei Nirenburg, Marjorie McShane, Kenneth W. Goodman, Sanjay Oruganti
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2409.18052
Pdf link: https://arxiv.org/pdf/2409.18052
Abstract Explanation is key to people having confidence in high-stakes AI systems. However, machine-learning-based systems - which account for almost all current AI - can't explain because they are usually black boxes. The explainable AI (XAI) movement hedges this problem by redefining "explanation". The human-centered explainable AI (HCXAI) movement identifies the explanation-oriented needs of users but can't fulfill them because of its commitment to machine learning. In order to achieve the kinds of explanations needed by real people operating in critical domains, we must rethink how to approach AI. We describe a hybrid approach to developing cognitive agents that uses a knowledge-based infrastructure supplemented by data obtained through machine learning when applicable. These agents will serve as assistants to humans who will bear ultimate responsibility for the decisions and actions of the human-robot team. We illustrate the explanatory potential of such agents using the under-the-hood panels of a demonstration system in which a team of simulated robots collaborates on a search task assigned by a human.

qiaoyuet / arxiv_daily

Showing new listings for Friday, 27 September 2024 #170

Keyword: differential privacy

Immersion and Invariance-based Coding for Privacy-Preserving Federated Learning

KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation

Fully Dynamic Graph Algorithms with Edge Differential Privacy

Slowly Scaling Per-Record Differential Privacy

Keyword: privacy

Immersion and Invariance-based Coding for Privacy-Preserving Federated Learning

SHEATH: Defending Horizontal Collaboration for Distributed CNNs against Adversarial Noise

Investigating Privacy Attacks in the Gray-Box Setting to Enhance Collaborative Learning Schemes

Blockchain-Enabled Variational Information Bottleneck for Data Extraction Based on Mutual Information in Internet of Vehicles

KIPPS: Knowledge infusion in Privacy Preserving Synthetic Data Generation

Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement

Does Worst-Performing Agent Lead the Pack? Analyzing Agent Dynamics in Unified Distributed SGD

BioZero: An Efficient and Privacy-Preserving Decentralized Biometric Authentication Protocol on Open Blockchain

Privacy-Preserving Redaction of Diagnosis Data through Source Code Analysis

On the Implicit Relation Between Low-Rank Adaptation and Differential Privacy

ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition

Expanding Perspectives on Data Privacy: Insights from Rural Togo

Multimodal Banking Dataset: Understanding Client Needs through Event Sequences

Fully Dynamic Graph Algorithms with Edge Differential Privacy

AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure

MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks

Demystifying Privacy in 5G Stand Alone Networks

TADAR: Thermal Array-based Detection and Ranging for Privacy-Preserving Human Sensing

Privacy for Quantum Annealing. Attack on Spin Reversal Transformations in the case of cryptanalysis

Byzantine-Robust Aggregation for Securing Decentralized Federated Learning

Federated Learning under Attack: Improving Gradient Inversion for Batch of Images

Implementing a Nordic-Baltic Federated Health Data Network: a case report

Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target Detection

Adaptive Stream Processing on Edge Devices through Active Inference

Slowly Scaling Per-Record Differential Privacy

Keyword: machine learning

Neural Network Architecture Search Enabled Wide-Deep Learning (NAS-WD) for Spatially Heterogenous Property Awared Chicken Woody Breast Classification and Hardness Regression

Collaborative Comic Generation: Integrating Visual Narrative Theories with AI Models for Enhanced Creativity

AAPM: Large Language Model Agent-based Asset Pricing Models

Model aggregation: minimizing empirical variance outperforms minimizing empirical error

Building Real-time Awareness of Out-of-distribution in Trajectory Prediction for Autonomous Vehicles

Investigating Privacy Attacks in the Gray-Box Setting to Enhance Collaborative Learning Schemes

Scalable quality control on processing of large diffusion-weighted and structural magnetic resonance imaging datasets

The poison of dimensionality

Non-asymptotic Convergence of Training Transformers for Next-token Prediction

The Technology of Outrage: Bias in Artificial Intelligence

data2lang2vec: Data Driven Typological Features Completion

AI Enabled Neutron Flux Measurement and Virtual Calibration in Boiling Water Reactors

Does Worst-Performing Agent Lead the Pack? Analyzing Agent Dynamics in Unified Distributed SGD

Functional Classification of Spiking Signal Data Using Artificial Intelligence Techniques: A Review

Expanding Perspectives on Data Privacy: Insights from Rural Togo

Provable Performance Guarantees of Copy Detection Patterns

Optimal Memorization Capacity of Transformers

Artificial Data Point Generation in Clustered Latent Space for Small Medical Datasets

Recent advances in interpretable machine learning using structure-based protein representations

Byzantine-Robust Aggregation for Securing Decentralized Federated Learning

Federated Learning under Attack: Improving Gradient Inversion for Batch of Images

Predicting the Stay Length of Patients in Hospitals using Convolutional Gated Recurrent Deep Learning Model

Bias Assessment and Data Drift Detection in Medical Image Analysis: A Survey

Sentiment Analysis of ML Projects: Bridging Emotional Intelligence and Code Quality

A multi-source data power load forecasting method using attention mechanism-based parallel cnn-gru

Designing Short-Stage CDC-XPUFs: Balancing Reliability, Cost, and Security in IoT Devices

Intelligent Energy Management: Remaining Useful Life Prediction and Charging Automation System Comprised of Deep Learning and the Internet of Things

Adaptive Stream Processing on Edge Devices through Active Inference

Predicting Anchored Text from Translation Memories for Machine Translation Using Deep Learning Methods

MMDVS-LF: A Multi-Modal Dynamic-Vision-Sensor Line Following Dataset

Next-Gen Software Engineering: AI-Assisted Big Models

Explaining Explaining