New submissions for Monday, 10 June 2024 (showing 370 of 370 entries )

Keyword: differential privacy

Tangent differential privacy

Authors: Lexing Ying
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.04535
Pdf link: https://arxiv.org/pdf/2406.04535
Abstract Differential privacy is a framework for protecting the identity of individual data points in the decision-making process. In this note, we propose a new form of differential privacy called tangent differential privacy. Compared with the usual differential privacy that is defined uniformly across data distributions, tangent differential privacy is tailored towards a specific data distribution of interest. It also allows for general distribution distances such as total variation distance and Wasserstein distance. In the case of risk minimization, we show that entropic regularization guarantees tangent differential privacy under rather general conditions on the risk function.
Contrastive explainable clustering with differential privacy
Authors: Dung Nguyen, Ariel Vetzler, Sarit Kraus, Anil Vullikanti
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.04610
Pdf link: https://arxiv.org/pdf/2406.04610
Abstract This paper presents a novel approach in Explainable AI (XAI), integrating contrastive explanations with differential privacy in clustering methods. For several basic clustering problems, including $k$-median and $k$-means, we give efficient differential private contrastive explanations that achieve essentially the same explanations as those that non-private clustering explanations can obtain. We define contrastive explanations as the utility difference between the original clustering utility and utility from clustering with a specifically fixed centroid. In each contrastive scenario, we designate a specific data point as the fixed centroid position, enabling us to measure the impact of this constraint on clustering utility under differential privacy. Extensive experiments across various datasets show our method's effectiveness in providing meaningful explanations without significantly compromising data privacy or clustering utility. This underscores our contribution to privacy-aware machine learning, demonstrating the feasibility of achieving a balance between privacy and utility in the explanation of clustering tasks.
Marking the Pace: A Blockchain-Enhanced Privacy-Traceable Strategy for Federated Recommender Systems
Authors: Zhen Cai, Tao Tang, Shuo Yu, Yunpeng Xiao, Feng Xia
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.04702
Pdf link: https://arxiv.org/pdf/2406.04702
Abstract Federated recommender systems have been crucially enhanced through data sharing and continuous model updates, attributed to the pervasive connectivity and distributed computing capabilities of Internet of Things (IoT) devices. Given the sensitivity of IoT data, transparent data processing in data sharing and model updates is paramount. However, existing methods fall short in tracing the flow of shared data and the evolution of model updates. Consequently, data sharing is vulnerable to exploitation by malicious entities, raising significant data privacy concerns, while excluding data sharing will result in sub-optimal recommendations. To mitigate these concerns, we present LIBERATE, a privacy-traceable federated recommender system. We design a blockchain-based traceability mechanism, ensuring data privacy during data sharing and model updates. We further enhance privacy protection by incorporating local differential privacy in user-server communication. Extensive evaluations with the real-world dataset corroborate LIBERATE's capabilities in ensuring data privacy during data sharing and model update while maintaining efficiency and performance. Results underscore blockchain-based traceability mechanism as a promising solution for privacy-preserving in federated recommender systems.
Black Box Differential Privacy Auditing Using Total Variation Distance
Authors: Antti Koskela, Jafar Mohammadi
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.04827
Pdf link: https://arxiv.org/pdf/2406.04827
Abstract We present a practical method to audit the differential privacy (DP) guarantees of a machine learning model using a small hold-out dataset that is not exposed to the model during the training. Having a score function such as the loss function employed during the training, our method estimates the total variation (TV) distance between scores obtained with a subset of the training data and the hold-out dataset. With some meta information about the underlying DP training algorithm, these TV distance values can be converted to $(\varepsilon,\delta)$-guarantees for any $\delta$. We show that these score distributions asymptotically give lower bounds for the DP guarantees of the underlying training algorithm, however, we perform a one-shot estimation for practicality reasons. We specify conditions that lead to lower bounds for the DP guarantees with high probability. To estimate the TV distance between the score distributions, we use a simple density estimation method based on histograms. We show that the TV distance gives a very close to optimally robust estimator and has an error rate $\mathcal{O}(k^{-1/3})$, where $k$ is the total number of samples. Numerical experiments on benchmark datasets illustrate the effectiveness of our approach and show improvements over baseline methods for black-box auditing.
Perturb-and-Project: Differentially Private Similarities and Marginals
Authors: Vincent Cohen-Addad, Tommaso d'Orsi, Alessandro Epasto, Vahab Mirrokni, Peilin Zhong
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2406.04868
Pdf link: https://arxiv.org/pdf/2406.04868
Abstract We revisit the input perturbations framework for differential privacy where noise is added to the input $A\in \mathcal{S}$ and the result is then projected back to the space of admissible datasets $\mathcal{S}$. Through this framework, we first design novel efficient algorithms to privately release pair-wise cosine similarities. Second, we derive a novel algorithm to compute $k$-way marginal queries over $n$ features. Prior work could achieve comparable guarantees only for $k$ even. Furthermore, we extend our results to $t$-sparse datasets, where our efficient algorithms yields novel, stronger guarantees whenever $t\le n^{5/6}/\log n\,.$ Finally, we provide a theoretical perspective on why \textit{fast} input perturbation algorithms works well in practice. The key technical ingredients behind our results are tight sum-of-squares certificates upper bounding the Gaussian complexity of sets of solutions.
Keyword: privacy

Tangent differential privacy
Authors: Lexing Ying
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.04535
Pdf link: https://arxiv.org/pdf/2406.04535
Abstract Differential privacy is a framework for protecting the identity of individual data points in the decision-making process. In this note, we propose a new form of differential privacy called tangent differential privacy. Compared with the usual differential privacy that is defined uniformly across data distributions, tangent differential privacy is tailored towards a specific data distribution of interest. It also allows for general distribution distances such as total variation distance and Wasserstein distance. In the case of risk minimization, we show that entropic regularization guarantees tangent differential privacy under rather general conditions on the risk function.
Contrastive explainable clustering with differential privacy
Authors: Dung Nguyen, Ariel Vetzler, Sarit Kraus, Anil Vullikanti
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.04610
Pdf link: https://arxiv.org/pdf/2406.04610
Abstract This paper presents a novel approach in Explainable AI (XAI), integrating contrastive explanations with differential privacy in clustering methods. For several basic clustering problems, including $k$-median and $k$-means, we give efficient differential private contrastive explanations that achieve essentially the same explanations as those that non-private clustering explanations can obtain. We define contrastive explanations as the utility difference between the original clustering utility and utility from clustering with a specifically fixed centroid. In each contrastive scenario, we designate a specific data point as the fixed centroid position, enabling us to measure the impact of this constraint on clustering utility under differential privacy. Extensive experiments across various datasets show our method's effectiveness in providing meaningful explanations without significantly compromising data privacy or clustering utility. This underscores our contribution to privacy-aware machine learning, demonstrating the feasibility of achieving a balance between privacy and utility in the explanation of clustering tasks.
LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model
Authors: Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiao-Wen Yang, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.04614
Pdf link: https://arxiv.org/pdf/2406.04614
Abstract Large language models (LLMs), including both proprietary and open-source models, have showcased remarkable capabilities in addressing a wide range of downstream tasks. Nonetheless, when it comes to practical Chinese legal tasks, these models fail to meet the actual requirements. Proprietary models do not ensure data privacy for sensitive legal cases, while open-source models demonstrate unsatisfactory performance due to their lack of legal knowledge. To address this problem, we introduce LawGPT, the first open-source model specifically designed for Chinese legal applications. LawGPT comprises two key components: legal-oriented pre-training and legal supervised fine-tuning. Specifically, we employ large-scale Chinese legal documents for legal-oriented pre-training to incorporate legal domain knowledge. To further improve the model's performance on downstream legal tasks, we create a knowledge-driven instruction dataset for legal supervised fine-tuning. Our experimental results demonstrate that LawGPT outperforms the open-source LLaMA 7B model. Our code and resources are publicly available at this https URL and have received 5.7K stars on GitHub.
Marking the Pace: A Blockchain-Enhanced Privacy-Traceable Strategy for Federated Recommender Systems
Authors: Zhen Cai, Tao Tang, Shuo Yu, Yunpeng Xiao, Feng Xia
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.04702
Pdf link: https://arxiv.org/pdf/2406.04702
Abstract Federated recommender systems have been crucially enhanced through data sharing and continuous model updates, attributed to the pervasive connectivity and distributed computing capabilities of Internet of Things (IoT) devices. Given the sensitivity of IoT data, transparent data processing in data sharing and model updates is paramount. However, existing methods fall short in tracing the flow of shared data and the evolution of model updates. Consequently, data sharing is vulnerable to exploitation by malicious entities, raising significant data privacy concerns, while excluding data sharing will result in sub-optimal recommendations. To mitigate these concerns, we present LIBERATE, a privacy-traceable federated recommender system. We design a blockchain-based traceability mechanism, ensuring data privacy during data sharing and model updates. We further enhance privacy protection by incorporating local differential privacy in user-server communication. Extensive evaluations with the real-world dataset corroborate LIBERATE's capabilities in ensuring data privacy during data sharing and model update while maintaining efficiency and performance. Results underscore blockchain-based traceability mechanism as a promising solution for privacy-preserving in federated recommender systems.
When Swarm Learning meets energy series data: A decentralized collaborative learning design based on blockchain
Authors: Lei Xu, Yulong Chen, Yuntian Chen, Longfeng Nie, Xuetao Wei, Liang Xue, Dongxiao Zhang
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2406.04743
Pdf link: https://arxiv.org/pdf/2406.04743
Abstract Machine learning models offer the capability to forecast future energy production or consumption and infer essential unknown variables from existing data. However, legal and policy constraints within specific energy sectors render the data sensitive, presenting technical hurdles in utilizing data from diverse sources. Therefore, we propose adopting a Swarm Learning (SL) scheme, which replaces the centralized server with a blockchain-based distributed network to address the security and privacy issues inherent in Federated Learning (FL)'s centralized architecture. Within this distributed Collaborative Learning framework, each participating organization governs nodes for inter-organizational communication. Devices from various organizations utilize smart contracts for parameter uploading and retrieval. Consensus mechanism ensures distributed consistency throughout the learning process, guarantees the transparent trustworthiness and immutability of parameters on-chain. The efficacy of the proposed framework is substantiated across three real-world energy series modeling scenarios with superior performance compared to Local Learning approaches, simultaneously emphasizing enhanced data security and privacy over Centralized Learning and FL method. Notably, as the number of data volume and the count of local epochs increases within a threshold, there is an improvement in model performance accompanied by a reduction in the variance of performance errors. Consequently, this leads to an increased stability and reliability in the outcomes produced by the model.
Approximated Coded Computing: Towards Fast, Private and Secure Distributed Machine Learning
Authors: Houming Qiu, Kun Zhu, Nguyen Cong Luong, Dusit Niyato
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2406.04747
Pdf link: https://arxiv.org/pdf/2406.04747
Abstract In a large-scale distributed machine learning system, coded computing has attracted wide-spread attention since it can effectively alleviate the impact of stragglers. However, several emerging problems greatly limit the performance of coded distributed systems. Firstly, an existence of colluding workers who collude results with each other leads to serious privacy leakage issues. Secondly, there are few existing works considering security issues in data transmission of distributed computing systems. Thirdly, the number of required results for which need to wait increases with the degree of decoding functions. In this paper, we design a secure and private approximated coded distributed computing (SPACDC) scheme that deals with the above-mentioned problems simultaneously. Our SPACDC scheme guarantees data security during the transmission process using a new encryption algorithm based on elliptic curve cryptography. Especially, the SPACDC scheme does not impose strict constraints on the minimum number of results required to be waited for. An extensive performance analysis is conducted to demonstrate the effectiveness of our SPACDC scheme. Furthermore, we present a secure and private distributed learning algorithm based on the SPACDC scheme, which can provide information-theoretic privacy protection for training data. Our experiments show that the SPACDC-based deep learning algorithm achieves a significant speedup over the baseline approaches.
Black Box Differential Privacy Auditing Using Total Variation Distance
Authors: Antti Koskela, Jafar Mohammadi
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.04827
Pdf link: https://arxiv.org/pdf/2406.04827
Abstract We present a practical method to audit the differential privacy (DP) guarantees of a machine learning model using a small hold-out dataset that is not exposed to the model during the training. Having a score function such as the loss function employed during the training, our method estimates the total variation (TV) distance between scores obtained with a subset of the training data and the hold-out dataset. With some meta information about the underlying DP training algorithm, these TV distance values can be converted to $(\varepsilon,\delta)$-guarantees for any $\delta$. We show that these score distributions asymptotically give lower bounds for the DP guarantees of the underlying training algorithm, however, we perform a one-shot estimation for practicality reasons. We specify conditions that lead to lower bounds for the DP guarantees with high probability. To estimate the TV distance between the score distributions, we use a simple density estimation method based on histograms. We show that the TV distance gives a very close to optimally robust estimator and has an error rate $\mathcal{O}(k^{-1/3})$, where $k$ is the total number of samples. Numerical experiments on benchmark datasets illustrate the effectiveness of our approach and show improvements over baseline methods for black-box auditing.
FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models
Authors: Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2406.04845
Pdf link: https://arxiv.org/pdf/2406.04845
Abstract Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at this https URL.
Perturb-and-Project: Differentially Private Similarities and Marginals
Authors: Vincent Cohen-Addad, Tommaso d'Orsi, Alessandro Epasto, Vahab Mirrokni, Peilin Zhong
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2406.04868
Pdf link: https://arxiv.org/pdf/2406.04868
Abstract We revisit the input perturbations framework for differential privacy where noise is added to the input $A\in \mathcal{S}$ and the result is then projected back to the space of admissible datasets $\mathcal{S}$. Through this framework, we first design novel efficient algorithms to privately release pair-wise cosine similarities. Second, we derive a novel algorithm to compute $k$-way marginal queries over $n$ features. Prior work could achieve comparable guarantees only for $k$ even. Furthermore, we extend our results to $t$-sparse datasets, where our efficient algorithms yields novel, stronger guarantees whenever $t\le n^{5/6}/\log n\,.$ Finally, we provide a theoretical perspective on why \textit{fast} input perturbation algorithms works well in practice. The key technical ingredients behind our results are tight sum-of-squares certificates upper bounding the Gaussian complexity of sets of solutions.
Concept Drift Detection using Ensemble of Integrally Private Models
Authors: Ayush K. Varshney, Vicenc Torra
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.04903
Pdf link: https://arxiv.org/pdf/2406.04903
Abstract Deep neural networks (DNNs) are one of the most widely used machine learning algorithm. DNNs requires the training data to be available beforehand with true labels. This is not feasible for many real-world problems where data arrives in the streaming form and acquisition of true labels are scarce and expensive. In the literature, not much focus has been given to the privacy prospect of the streaming data, where data may change its distribution frequently. These concept drifts must be detected privately in order to avoid any disclosure risk from DNNs. Existing privacy models use concept drift detection schemes such ADWIN, KSWIN to detect the drifts. In this paper, we focus on the notion of integrally private DNNs to detect concept drifts. Integrally private DNNs are the models which recur frequently from different datasets. Based on this, we introduce an ensemble methodology which we call 'Integrally Private Drift Detection' (IPDD) method to detect concept drift from private models. Our IPDD method does not require labels to detect drift but assumes true labels are available once the drift has been detected. We have experimented with binary and multi-class synthetic and real-world data. Our experimental results show that our methodology can privately detect concept drift, has comparable utility (even better in some cases) with ADWIN and outperforms utility from different levels of differentially private models. The source code for the paper is available \hyperlink{this https URL}{here}.
Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation
Authors: Nachiket Kotalwar, Alkis Gotovos, Adish Singla
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.05053
Pdf link: https://arxiv.org/pdf/2406.05053
Abstract Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners. Recent works have primarily focused on improving the quality of generated feedback to achieve human tutors' quality. While quality is an important performance criterion, it is not the only criterion to optimize for real-world educational deployments. In this paper, we benchmark language models for programming feedback generation across several performance criteria, including quality, cost, time, and data privacy. The key idea is to leverage recent advances in the new paradigm of in-browser inference that allow running these models directly in the browser, thereby providing direct benefits across cost and data privacy. To boost the feedback quality of small models compatible with in-browser inference engines, we develop a fine-tuning pipeline based on GPT-4 generated synthetic data. We showcase the efficacy of fine-tuned Llama3-8B and Phi3-3.8B 4-bit quantized models using WebLLM's in-browser inference engine on three different Python programming datasets. We will release the full implementation along with a web app and datasets to facilitate further research on in-browser language models.
Keyword: machine learning

Naming the Pain in Machine Learning-Enabled Systems Engineering
Authors: Marcos Kalinowski, Daniel Mendez, Görkem Giray, Antonio Pedro Santos Alves, Kelly Azevedo, Tatiana Escovedo, Hugo Villamizar, Helio Lopes, Teresa Baldassarre, Stefan Wagner, Stefan Biffl, Jürgen Musil, Michael Felderer, Niklas Lavesson, Tony Gorschek
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.04359
Pdf link: https://arxiv.org/pdf/2406.04359
Abstract Context: Machine learning (ML)-enabled systems are being increasingly adopted by companies aiming to enhance their products and operational processes. Objective: This paper aims to deliver a comprehensive overview of the current status quo of engineering ML-enabled systems and lay the foundation to steer practically relevant and problem-driven academic research. Method: We conducted an international survey to collect insights from practitioners on the current practices and problems in engineering ML-enabled systems. We received 188 complete responses from 25 countries. We conducted quantitative statistical analyses on contemporary practices using bootstrapping with confidence intervals and qualitative analyses on the reported problems using open and axial coding procedures. Results: Our survey results reinforce and extend existing empirical evidence on engineering ML-enabled systems, providing additional insights into typical ML-enabled systems project contexts, the perceived relevance and complexity of ML life cycle phases, and current practices related to problem understanding, model deployment, and model monitoring. Furthermore, the qualitative analysis provides a detailed map of the problems practitioners face within each ML life cycle phase and the problems causing overall project failure. Conclusions: The results contribute to a better understanding of the status quo and problems in practical environments. We advocate for the further adaptation and dissemination of software engineering practices to enhance the engineering of ML-enabled systems.
Dynamic Online Recommendation for Two-Sided Market with Bayesian Incentive Compatibility
Authors: Yuantong Li, Guang Cheng, Xiaowu Dai
Subjects: Subjects: Information Retrieval (cs.IR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.04374
Pdf link: https://arxiv.org/pdf/2406.04374
Abstract Recommender systems play a crucial role in internet economies by connecting users with relevant products or services. However, designing effective recommender systems faces two key challenges: (1) the exploration-exploitation tradeoff in balancing new product exploration against exploiting known preferences, and (2) dynamic incentive compatibility in accounting for users' self-interested behaviors and heterogeneous preferences. This paper formalizes these challenges into a Dynamic Bayesian Incentive-Compatible Recommendation Protocol (DBICRP). To address the DBICRP, we propose a two-stage algorithm (RCB) that integrates incentivized exploration with an efficient offline learning component for exploitation. In the first stage, our algorithm explores available products while maintaining dynamic incentive compatibility to determine sufficient sample sizes. The second stage employs inverse proportional gap sampling integrated with an arbitrary machine learning method to ensure sublinear regret. Theoretically, we prove that RCB achieves $O(\sqrt{KdT})$ regret and satisfies Bayesian incentive compatibility (BIC) under a Gaussian prior assumption. Empirically, we validate RCB's strong incentive gain, sublinear regret, and robustness through simulations and a real-world application on personalized warfarin dosing. Our work provides a principled approach for incentive-aware recommendation in online preference learning settings.
On Regularization via Early Stopping for Least Squares Regression
Authors: Rishi Sonthalia, Jackie Lok, Elizaveta Rebrova
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.04425
Pdf link: https://arxiv.org/pdf/2406.04425
Abstract A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule $\eta_k$, and a finite time horizon $T$, the early stopped solution $\beta_T$ is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.
Optimizing Autonomous Driving for Safety: A Human-Centric Approach with LLM-Enhanced RLHF
Authors: Yuan Sun, Navid Salami Pargoo, Peter J. Jin, Jorge Ortiz
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.04481
Pdf link: https://arxiv.org/pdf/2406.04481
Abstract Reinforcement Learning from Human Feedback (RLHF) is popular in large language models (LLMs), whereas traditional Reinforcement Learning (RL) often falls short. Current autonomous driving methods typically utilize either human feedback in machine learning, including RL, or LLMs. Most feedback guides the car agent's learning process (e.g., controlling the car). RLHF is usually applied in the fine-tuning step, requiring direct human "preferences," which are not commonly used in optimizing autonomous driving models. In this research, we innovatively combine RLHF and LLMs to enhance autonomous driving safety. Training a model with human guidance from scratch is inefficient. Our framework starts with a pre-trained autonomous car agent model and implements multiple human-controlled agents, such as cars and pedestrians, to simulate real-life road environments. The autonomous car model is not directly controlled by humans. We integrate both physical and physiological feedback to fine-tune the model, optimizing this process using LLMs. This multi-agent interactive environment ensures safe, realistic interactions before real-world application. Finally, we will validate our model using data gathered from real-life testbeds located in New Jersey and New York City.
OCCAM: Towards Cost-Efficient and Accuracy-Aware Image Classification Inference
Authors: Dujian Ding, Bicheng Xu, Laks V.S. Lakshmanan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.04508
Pdf link: https://arxiv.org/pdf/2406.04508
Abstract Image classification is a fundamental building block for a majority of computer vision applications. With the growing popularity and capacity of machine learning models, people can easily access trained image classifiers as a service online or offline. However, model use comes with a cost and classifiers of higher capacity usually incur higher inference costs. To harness the respective strengths of different classifiers, we propose a principled approach, OCCAM, to compute the best classifier assignment strategy over image classification queries (termed as the optimal model portfolio) so that the aggregated accuracy is maximized, under user-specified cost budgets. Our approach uses an unbiased and low-variance accuracy estimator and effectively computes the optimal solution by solving an integer linear programming problem. On a variety of real-world datasets, OCCAM achieves 40% cost reduction with little to no accuracy drop.
Rare Class Prediction Model for Smart Industry in Semiconductor Manufacturing
Authors: Abdelrahman Farrag, Mohammed-Khalil Ghali, Yu Jin
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.04533
Pdf link: https://arxiv.org/pdf/2406.04533
Abstract The evolution of industry has enabled the integration of physical and digital systems, facilitating the collection of extensive data on manufacturing processes. This integration provides a reliable solution for improving process quality and managing equipment health. However, data collected from real manufacturing processes often exhibit challenging properties, such as severe class imbalance, high rates of missing values, and noisy features, which hinder effective machine learning implementation. In this study, a rare class prediction approach is developed for in situ data collected from a smart semiconductor manufacturing process. The primary objective is to build a model that addresses issues of noise and class imbalance, enhancing class separation. The developed approach demonstrated promising results compared to existing literature, which would allow the prediction of new observations that could give insights into future maintenance plans and production quality. The model was evaluated using various performance metrics, with ROC curves showing an AUC of 0.95, a precision of 0.66, and a recall of 0.96
GNNAnatomy: Systematic Generation and Evaluation of Multi-Level Explanations for Graph Neural Networks
Authors: Hsiao-Ying Lu, Yiran Li, Ujwal Pratap Krishna Kaluvakolanu Thyagarajan, Kwan-Liu Ma
Subjects: Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2406.04548
Pdf link: https://arxiv.org/pdf/2406.04548
Abstract Graph Neural Networks (GNNs) have proven highly effective in various machine learning (ML) tasks involving graphs, such as node/graph classification and link prediction. However, explaining the decisions made by GNNs poses challenges because of the aggregated relational information based on graph structure, leading to complex data transformations. Existing methods for explaining GNNs often face limitations in systematically exploring diverse substructures and evaluating results in the absence of ground truths. To address this gap, we introduce GNNAnatomy, a model- and dataset-agnostic visual analytics system designed to facilitate the generation and evaluation of multi-level explanations for GNNs. In GNNAnatomy, we employ graphlets to elucidate GNN behavior in graph-level classification tasks. By analyzing the associations between GNN classifications and graphlet frequencies, we formulate hypothesized factual and counterfactual explanations. To validate a hypothesized graphlet explanation, we introduce two metrics: (1) the correlation between its frequency and the classification confidence, and (2) the change in classification confidence after removing this substructure from the original graph. To demonstrate the effectiveness of GNNAnatomy, we conduct case studies on both real-world and synthetic graph datasets from various domains. Additionally, we qualitatively compare GNNAnatomy with a state-of-the-art GNN explainer, demonstrating the utility and versatility of our design.
On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization
Authors: Motahareh Sohrabi, Juan Ramirez, Tianyue H. Zhang, Simon Lacoste-Julien, Jose Gallego-Posada
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2406.04558
Pdf link: https://arxiv.org/pdf/2406.04558
Abstract Constrained optimization offers a powerful framework to prescribe desired behaviors in neural network models. Typically, constrained problems are solved via their min-max Lagrangian formulations, which exhibit unstable oscillatory dynamics when optimized using gradient descent-ascent. The adoption of constrained optimization techniques in the machine learning community is currently limited by the lack of reliable, general-purpose update schemes for the Lagrange multipliers. This paper proposes the $\nu$PI algorithm and contributes an optimization perspective on Lagrange multiplier updates based on PI controllers, extending the work of Stooke, Achiam and Abbeel (2020). We provide theoretical and empirical insights explaining the inability of momentum methods to address the shortcomings of gradient descent-ascent, and contrast this with the empirical success of our proposed $\nu$PI controller. Moreover, we prove that $\nu$PI generalizes popular momentum methods for single-objective minimization. Our experiments demonstrate that $\nu$PI reliably stabilizes the multiplier dynamics and its hyperparameters enjoy robust and predictable behavior.
A Unified View of Group Fairness Tradeoffs Using Partial Information Decomposition
Authors: Faisal Hamman, Sanghamitra Dutta
Subjects: Subjects: Information Theory (cs.IT); Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.04562
Pdf link: https://arxiv.org/pdf/2406.04562
Abstract This paper introduces a novel information-theoretic perspective on the relationship between prominent group fairness notions in machine learning, namely statistical parity, equalized odds, and predictive parity. It is well known that simultaneous satisfiability of these three fairness notions is usually impossible, motivating practitioners to resort to approximate fairness solutions rather than stringent satisfiability of these definitions. However, a comprehensive analysis of their interrelations, particularly when they are not exactly satisfied, remains largely unexplored. Our main contribution lies in elucidating an exact relationship between these three measures of (un)fairness by leveraging a body of work in information theory called partial information decomposition (PID). In this work, we leverage PID to identify the granular regions where these three measures of (un)fairness overlap and where they disagree with each other leading to potential tradeoffs. We also include numerical simulations to complement our results.
Contrastive explainable clustering with differential privacy
Authors: Dung Nguyen, Ariel Vetzler, Sarit Kraus, Anil Vullikanti
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.04610
Pdf link: https://arxiv.org/pdf/2406.04610
Abstract This paper presents a novel approach in Explainable AI (XAI), integrating contrastive explanations with differential privacy in clustering methods. For several basic clustering problems, including $k$-median and $k$-means, we give efficient differential private contrastive explanations that achieve essentially the same explanations as those that non-private clustering explanations can obtain. We define contrastive explanations as the utility difference between the original clustering utility and utility from clustering with a specifically fixed centroid. In each contrastive scenario, we designate a specific data point as the fixed centroid position, enabling us to measure the impact of this constraint on clustering utility under differential privacy. Extensive experiments across various datasets show our method's effectiveness in providing meaningful explanations without significantly compromising data privacy or clustering utility. This underscores our contribution to privacy-aware machine learning, demonstrating the feasibility of achieving a balance between privacy and utility in the explanation of clustering tasks.
CTSyn: A Foundational Model for Cross Tabular Data Generation
Authors: Xiaofeng Lin, Chenheng Xu, Matthew Yang, Guang Cheng
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.04619
Pdf link: https://arxiv.org/pdf/2406.04619
Abstract Generative Foundation Models (GFMs) have produced synthetic data with remarkable quality in modalities such as images and text. However, applying GFMs to tabular data poses significant challenges due to the inherent heterogeneity of table features. Existing cross-table learning frameworks are hindered by the absence of both a generative model backbone and a decoding mechanism for heterogeneous feature values. To overcome these limitations, we introduce the Cross-Table Synthesizer (CTSyn), a diffusion-based foundational model tailored for tabular data generation. CTSyn introduces three major components: an aggregator that consolidates heterogeneous tables into a unified latent space; a conditional latent diffusion model for sampling from this space; and type-specific decoders that reconstruct values of varied data types from sampled latent vectors. Extensive testing on real-world datasets reveals that CTSyn not only significantly outperforms existing table synthesizers in utility and diversity, but also uniquely enhances performances of downstream machine learning beyond what is achievable with real data, thus establishing a new paradigm for synthetic data generation.
Advanced Payment Security System:XGBoost, CatBoost and SMOTE Integrated
Authors: Qi Zheng, Chang Yu, Jin Cao, Yongshun Xu, Qianwen Xing, Yinxin Jin
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.04658
Pdf link: https://arxiv.org/pdf/2406.04658
Abstract With the rise of various online and mobile payment systems, transaction fraud has become a significant threat to financial security. This study explores the application of advanced machine learning models, specifically XGBoost and LightGBM, for developing a more accurate and robust Payment Security Protection this http URL enhance data reliability, we meticulously processed the data sources and used SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance and improve data representation. By selecting highly correlated features, we aimed to strengthen the training process and boost model performance.We conducted thorough performance evaluations of our proposed models, comparing them against traditional methods including Random Forest, Neural Network, and Logistic Regression. Key metrics such as Precision, Recall, and F1 Score were used to rigorously assess their effectiveness.Our detailed analyses and comparisons reveal that the combination of SMOTE with XGBoost and LightGBM offers a highly efficient and powerful mechanism for payment security protection. The results show that these models not only outperform traditional approaches but also hold significant promise for advancing the field of transaction fraud prevention.
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models
Authors: Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2406.04673
Pdf link: https://arxiv.org/pdf/2406.04673
Abstract Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.
ConDiff: A Challenging Dataset for Neural Solvers of Partial Differential Equations
Authors: Vladislav Trifonov, Alexander Rudikov, Oleg Iliev, Ivan Oseledets, Ekaterina Muravleva
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.04709
Pdf link: https://arxiv.org/pdf/2406.04709
Abstract We present ConDiff, a novel dataset for scientific machine learning. ConDiff focuses on the diffusion equation with varying coefficients, a fundamental problem in many applications of parametric partial differential equations (PDEs). The main novelty of the proposed dataset is that we consider discontinuous coefficients with high contrast. These coefficient functions are sampled from a selected set of distributions. This class of problems is not only of great academic interest, but is also the basis for describing various environmental and industrial problems. In this way, ConDiff shortens the gap with real-world problems while remaining fully synthetic and easy to use. ConDiff consists of a diverse set of diffusion equations with coefficients covering a wide range of contrast levels and heterogeneity with a measurable complexity metric for clearer comparison between different coefficient functions. We baseline ConDiff on standard deep learning models in the field of scientific machine learning. By providing a large number of problem instances, each with its own coefficient function and right-hand side, we hope to encourage the development of novel physics-based deep learning approaches, such as neural operators and physics-informed neural networks, ultimately driving progress towards more accurate and efficient solutions of complex PDE problems.
Unsupervised representation learning with Hebbian synaptic and structural plasticity in brain-like feedforward neural networks
Authors: Naresh Ravichandran, Anders Lansner, Pawel Herman
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2406.04733
Pdf link: https://arxiv.org/pdf/2406.04733
Abstract Neural networks that can capture key principles underlying brain computation offer exciting new opportunities for developing artificial intelligence and brain-like computing algorithms. Such networks remain biologically plausible while leveraging localized forms of synaptic learning rules and modular network architecture found in the neocortex. Compared to backprop-driven deep learning approches, they provide more suitable models for deploying on neuromorphic hardware and have greater potential for scalability on large-scale computing clusters. The development of such brain-like neural networks depends on having a learning procedure that can build effective internal representations from data. In this work, we introduce and evaluate a brain-like neural network model capable of unsupervised representation learning. It builds on the Bayesian Confidence Propagation Neural Network (BCPNN), which has earlier been implemented as abstract as well as biophyscially detailed recurrent attractor neural networks explaining various cortical associative memory phenomena. Here we developed a feedforward BCPNN model to perform representation learning by incorporating a range of brain-like attributes derived from neocortical circuits such as cortical columns, divisive normalization, Hebbian synaptic plasticity, structural plasticity, sparse activity, and sparse patchy connectivity. The model was tested on a diverse set of popular machine learning benchmarks: grayscale images (MNIST, Fashion-MNIST), RGB natural images (SVHN, CIFAR-10), QSAR (MUV, HIV), and malware detection (EMBER). The performance of the model when using a linear classifier to predict the class labels fared competitively with conventional multi-layer perceptrons and other state-of-the-art brain-like neural networks.
When Swarm Learning meets energy series data: A decentralized collaborative learning design based on blockchain
Authors: Lei Xu, Yulong Chen, Yuntian Chen, Longfeng Nie, Xuetao Wei, Liang Xue, Dongxiao Zhang
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2406.04743
Pdf link: https://arxiv.org/pdf/2406.04743
Abstract Machine learning models offer the capability to forecast future energy production or consumption and infer essential unknown variables from existing data. However, legal and policy constraints within specific energy sectors render the data sensitive, presenting technical hurdles in utilizing data from diverse sources. Therefore, we propose adopting a Swarm Learning (SL) scheme, which replaces the centralized server with a blockchain-based distributed network to address the security and privacy issues inherent in Federated Learning (FL)'s centralized architecture. Within this distributed Collaborative Learning framework, each participating organization governs nodes for inter-organizational communication. Devices from various organizations utilize smart contracts for parameter uploading and retrieval. Consensus mechanism ensures distributed consistency throughout the learning process, guarantees the transparent trustworthiness and immutability of parameters on-chain. The efficacy of the proposed framework is substantiated across three real-world energy series modeling scenarios with superior performance compared to Local Learning approaches, simultaneously emphasizing enhanced data security and privacy over Centralized Learning and FL method. Notably, as the number of data volume and the count of local epochs increases within a threshold, there is an improvement in model performance accompanied by a reduction in the variance of performance errors. Consequently, this leads to an increased stability and reliability in the outcomes produced by the model.
Approximated Coded Computing: Towards Fast, Private and Secure Distributed Machine Learning
Authors: Houming Qiu, Kun Zhu, Nguyen Cong Luong, Dusit Niyato
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2406.04747
Pdf link: https://arxiv.org/pdf/2406.04747
Abstract In a large-scale distributed machine learning system, coded computing has attracted wide-spread attention since it can effectively alleviate the impact of stragglers. However, several emerging problems greatly limit the performance of coded distributed systems. Firstly, an existence of colluding workers who collude results with each other leads to serious privacy leakage issues. Secondly, there are few existing works considering security issues in data transmission of distributed computing systems. Thirdly, the number of required results for which need to wait increases with the degree of decoding functions. In this paper, we design a secure and private approximated coded distributed computing (SPACDC) scheme that deals with the above-mentioned problems simultaneously. Our SPACDC scheme guarantees data security during the transmission process using a new encryption algorithm based on elliptic curve cryptography. Especially, the SPACDC scheme does not impose strict constraints on the minimum number of results required to be waited for. An extensive performance analysis is conducted to demonstrate the effectiveness of our SPACDC scheme. Furthermore, we present a secure and private distributed learning algorithm based on the SPACDC scheme, which can provide information-theoretic privacy protection for training data. Our experiments show that the SPACDC-based deep learning algorithm achieves a significant speedup over the baseline approaches.
Probabilistic Weather Forecasting with Hierarchical Graph Neural Networks
Authors: Joel Oskarsson, Tomas Landelius, Marc Peter Deisenroth, Fredrik Lindsten
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.04759
Pdf link: https://arxiv.org/pdf/2406.04759
Abstract In recent years, machine learning has established itself as a powerful tool for high-resolution weather forecasting. While most current machine learning models focus on deterministic forecasts, accurately capturing the uncertainty in the chaotic weather system calls for probabilistic modeling. We propose a probabilistic weather forecasting model called Graph-EFM, combining a flexible latent-variable formulation with the successful graph-based forecasting framework. The use of a hierarchical graph construction allows for efficient sampling of spatially coherent forecasts. Requiring only a single forward pass per time step, Graph-EFM allows for fast generation of arbitrarily large ensembles. We experiment with the model on both global and limited area forecasting. Ensemble forecasts from Graph-EFM achieve equivalent or lower errors than comparable deterministic models, with the added benefit of accurately capturing forecast uncertainty.
GENIE: Watermarking Graph Neural Networks for Link Prediction
Authors: Venkata Sai Pranav Bachina, Ankit Gangwal, Aaryan Ajay Sharma, Charu Sharma
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.04805
Pdf link: https://arxiv.org/pdf/2406.04805
Abstract Graph Neural Networks (GNNs) have advanced the field of machine learning by utilizing graph-structured data, which is ubiquitous in the real world. GNNs have applications in various fields, ranging from social network analysis to drug discovery. GNN training is strenuous, requiring significant computational resources and human expertise. It makes a trained GNN an indispensable Intellectual Property (IP) for its owner. Recent studies have shown GNNs to be vulnerable to model-stealing attacks, which raises concerns over IP rights protection. Watermarking has been shown to be effective at protecting the IP of a GNN model. Existing efforts to develop a watermarking scheme for GNNs have only focused on the node classification and the graph classification tasks. To the best of our knowledge, we introduce the first-ever watermarking scheme for GNNs tailored to the Link Prediction (LP) task. We call our proposed watermarking scheme GENIE (watermarking Graph nEural Networks for lInk prEdiction). We design GENIE using a novel backdoor attack to create a trigger set for two key methods of LP: (1) node representation-based and (2) subgraph-based. In GENIE, the watermark is embedded into the GNN model by training it on both the trigger set and a modified training set, resulting in a watermarked GNN model. To assess a suspect model, we verify the watermark against the trigger set. We extensively evaluate GENIE across 3 model architectures (i.e., SEAL, GCN, and GraphSAGE) and 7 real-world datasets. Furthermore, we validate the robustness of GENIE against 11 state-of-the-art watermark removal techniques and 3 model extraction attacks. We also demonstrate that GENIE is robust against ownership piracy attack. Our ownership demonstration scheme statistically guarantees both False Positive Rate (FPR) and False Negative Rate (FNR) to be less than $10^{-6}$.
Black Box Differential Privacy Auditing Using Total Variation Distance
Authors: Antti Koskela, Jafar Mohammadi
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.04827
Pdf link: https://arxiv.org/pdf/2406.04827
Abstract We present a practical method to audit the differential privacy (DP) guarantees of a machine learning model using a small hold-out dataset that is not exposed to the model during the training. Having a score function such as the loss function employed during the training, our method estimates the total variation (TV) distance between scores obtained with a subset of the training data and the hold-out dataset. With some meta information about the underlying DP training algorithm, these TV distance values can be converted to $(\varepsilon,\delta)$-guarantees for any $\delta$. We show that these score distributions asymptotically give lower bounds for the DP guarantees of the underlying training algorithm, however, we perform a one-shot estimation for practicality reasons. We specify conditions that lead to lower bounds for the DP guarantees with high probability. To estimate the TV distance between the score distributions, we use a simple density estimation method based on histograms. We show that the TV distance gives a very close to optimally robust estimator and has an error rate $\mathcal{O}(k^{-1/3})$, where $k$ is the total number of samples. Numerical experiments on benchmark datasets illustrate the effectiveness of our approach and show improvements over baseline methods for black-box auditing.
Diversified Batch Selection for Training Acceleration
Authors: Feng Hong, Yueming Lyu, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, Yanfeng Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.04872
Pdf link: https://arxiv.org/pdf/2406.04872
Abstract The remarkable success of modern machine learning models on large datasets often demands extensive training time and resource consumption. To save cost, a prevalent research line, known as online batch selection, explores selecting informative subsets during the training process. Although recent efforts achieve advancements by measuring the impact of each sample on generalization, their reliance on additional reference models inherently limits their practical applications, when there are no such ideal models available. On the other hand, the vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner, which sacrifices the diversity and induces the redundancy. To tackle this dilemma, we propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples. Specifically, we define a novel selection objective that measures the group-wise orthogonalized representativeness to combat the redundancy issue of previous sample-wise criteria, and provide a principled selection-efficient realization. Extensive experiments across various tasks demonstrate the significant superiority of DivBS in the performance-speedup trade-off. The code is publicly available.
Beyond Data, Towards Sustainability: A Sydney Case Study on Urban Digital Twins
Authors: Ammar Sohail, Bojie Shen, Muhammad Aamir Cheema, Mohammed Eunus Ali, Anwaar Ulhaq, Muhammad Ali Babar, Asama Qureshi
Subjects: Subjects: Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2406.04902
Pdf link: https://arxiv.org/pdf/2406.04902
Abstract As urban areas grapple with unprecedented challenges stemming from population growth and climate change, the emergence of urban digital twins offers a promising solution. This paper presents a case study focusing on Sydney's urban digital twin, a virtual replica integrating diverse real-time and historical data, including weather, crime, emissions, and traffic. Through advanced visualization and data analysis techniques, the study explores some applications of this digital twin in urban sustainability, such as spatial ranking of suburbs and automatic identification of correlations between variables. Additionally, the research delves into predictive modeling, employing machine learning to forecast traffic crash risks using environmental data, showcasing the potential for proactive interventions. The contributions of this work lie in the comprehensive exploration of a city-scale digital twin for sustainable urban planning, offering a multifaceted approach to data-driven decision-making.
Concept Drift Detection using Ensemble of Integrally Private Models
Authors: Ayush K. Varshney, Vicenc Torra
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.04903
Pdf link: https://arxiv.org/pdf/2406.04903
Abstract Deep neural networks (DNNs) are one of the most widely used machine learning algorithm. DNNs requires the training data to be available beforehand with true labels. This is not feasible for many real-world problems where data arrives in the streaming form and acquisition of true labels are scarce and expensive. In the literature, not much focus has been given to the privacy prospect of the streaming data, where data may change its distribution frequently. These concept drifts must be detected privately in order to avoid any disclosure risk from DNNs. Existing privacy models use concept drift detection schemes such ADWIN, KSWIN to detect the drifts. In this paper, we focus on the notion of integrally private DNNs to detect concept drifts. Integrally private DNNs are the models which recur frequently from different datasets. Based on this, we introduce an ensemble methodology which we call 'Integrally Private Drift Detection' (IPDD) method to detect concept drift from private models. Our IPDD method does not require labels to detect drift but assumes true labels are available once the drift has been detected. We have experimented with binary and multi-class synthetic and real-world data. Our experimental results show that our methodology can privately detect concept drift, has comparable utility (even better in some cases) with ADWIN and outperforms utility from different levels of differentially private models. The source code for the paper is available \hyperlink{this https URL}{here}.
AGBD: A Global-scale Biomass Dataset
Authors: Ghjulia Sialelli, Torben Peters, Jan D. Wegner, Konrad Schindler
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2406.04928
Pdf link: https://arxiv.org/pdf/2406.04928
Abstract Accurate estimates of Above Ground Biomass (AGB) are essential in addressing two of humanity's biggest challenges, climate change and biodiversity loss. Existing datasets for AGB estimation from satellite imagery are limited. Either they focus on specific, local regions at high resolution, or they offer global coverage at low resolution. There is a need for a machine learning-ready, globally representative, high-resolution benchmark. Our findings indicate significant variability in biomass estimates across different vegetation types, emphasizing the necessity for a dataset that accurately captures global diversity. To address these gaps, we introduce a comprehensive new dataset that is globally distributed, covers a range of vegetation types, and spans several years. This dataset combines AGB reference data from the GEDI mission with data from Sentinel-2 and PALSAR-2 imagery. Additionally, it includes pre-processed high-level features such as a dense canopy height map, an elevation map, and a land-cover classification map. We also produce a dense, high-resolution (10m) map of AGB predictions for the entire area covered by the dataset. Rigorously tested, our dataset is accompanied by several benchmark models and is publicly available. It can be easily accessed using a single line of code, offering a solid basis for efforts towards global AGB estimation. The GitHub repository this http URL serves as a one-stop shop for all code and data.
CarbonSense: A Multimodal Dataset and Baseline for Carbon Flux Modelling
Authors: Matthew Fortier, Mats L. Richter, Oliver Sonnentag, Chris Pal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.04940
Pdf link: https://arxiv.org/pdf/2406.04940
Abstract Terrestrial carbon fluxes provide vital information about our biosphere's health and its capacity to absorb anthropogenic CO$_2$ emissions. The importance of predicting carbon fluxes has led to the emerging field of data-driven carbon flux modelling (DDCFM), which uses statistical techniques to predict carbon fluxes from biophysical data. However, the field lacks a standardized dataset to promote comparisons between models. To address this gap, we present CarbonSense, the first machine learning-ready dataset for DDCFM. CarbonSense integrates measured carbon fluxes, meteorological predictors, and satellite imagery from 385 locations across the globe, offering comprehensive coverage and facilitating robust model training. Additionally, we provide a baseline model using a current state-of-the-art DDCFM approach and a novel transformer based model. Our experiments illustrate the potential gains that multimodal deep learning techniques can bring to this domain. By providing these resources, we aim to lower the barrier to entry for other deep learning researchers to develop new models and drive new advances in carbon flux modelling.
ADBA:Approximation Decision Boundary Approach for Black-Box Adversarial Attacks
Authors: Feiyang Wang, Xingquan Zuo, Hai Huang, Gang Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.04998
Pdf link: https://arxiv.org/pdf/2406.04998
Abstract Many machine learning models are susceptible to adversarial attacks, with decision-based black-box attacks representing the most critical threat in real-world applications. These attacks are extremely stealthy, generating adversarial examples using hard labels obtained from the target machine learning model. This is typically realized by optimizing perturbation directions, guided by decision boundaries identified through query-intensive exact search, significantly limiting the attack success rate. This paper introduces a novel approach using the Approximation Decision Boundary (ADB) to efficiently and accurately compare perturbation directions without precisely determining decision boundaries. The effectiveness of our ADB approach (ADBA) hinges on promptly identifying suitable ADB, ensuring reliable differentiation of all perturbation directions. For this purpose, we analyze the probability distribution of decision boundaries, confirming that using the distribution's median value as ADB can effectively distinguish different perturbation directions, giving rise to the development of the ADBA-md algorithm. ADBA-md only requires four queries on average to differentiate any pair of perturbation directions, which is highly query-efficient. Extensive experiments on six well-known image classifiers clearly demonstrate the superiority of ADBA and ADBA-md over multiple state-of-the-art black-box attacks.
Designs for Enabling Collaboration in Human-Machine Teaming via Interactive and Explainable Systems
Authors: Rohan Paleja, Michael Munje, Kimberlee Chang, Reed Jensen, Matthew Gombolay
Subjects: Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2406.05003
Pdf link: https://arxiv.org/pdf/2406.05003
Abstract Collaborative robots and machine learning-based virtual agents are increasingly entering the human workspace with the aim of increasing productivity and enhancing safety. Despite this, we show in a ubiquitous experimental domain, Overcooked-AI, that state-of-the-art techniques for human-machine teaming (HMT), which rely on imitation or reinforcement learning, are brittle and result in a machine agent that aims to decouple the machine and human's actions to act independently rather than in a synergistic fashion. To remedy this deficiency, we develop HMT approaches that enable iterative, mixed-initiative team development allowing end-users to interactively reprogram interpretable AI teammates. Our 50-subject study provides several findings that we summarize into guidelines. While all approaches underperform a simple collaborative heuristic (a critical, negative result for learning-based methods), we find that white-box approaches supported by interactive modification can lead to significant team development, outperforming white-box approaches alone, and black-box approaches are easier to train and result in better HMT performance highlighting a tradeoff between explainability and interactivity versus ease-of-training. Together, these findings present three important directions: 1) Improving the ability to generate collaborative agents with white-box models, 2) Better learning methods to facilitate collaboration rather than individualized coordination, and 3) Mixed-initiative interfaces that enable users, who may vary in ability, to improve collaboration.
Scaling up Probabilistic PDE Simulators with Structured Volumetric Information
Authors: Tim Weiland, Marvin Pförtner, Philipp Hennig
Subjects: Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2406.05020
Pdf link: https://arxiv.org/pdf/2406.05020
Abstract Modeling real-world problems with partial differential equations (PDEs) is a prominent topic in scientific machine learning. Classic solvers for this task continue to play a central role, e.g. to generate training data for deep learning analogues. Any such numerical solution is subject to multiple sources of uncertainty, both from limited computational resources and limited data (including unknown parameters). Gaussian process analogues to classic PDE simulation methods have recently emerged as a framework to construct fully probabilistic estimates of all these types of uncertainty. So far, much of this work focused on theoretical foundations, and as such is not particularly data efficient or scalable. Here we propose a framework combining a discretization scheme based on the popular Finite Volume Method with complementary numerical linear algebra techniques. Practical experiments, including a spatiotemporal tsunami simulation, demonstrate substantially improved scaling behavior of this approach over previous collocation-based techniques.
GANetic Loss for Generative Adversarial Networks with a Focus on Medical Applications
Authors: Shakhnaz Akhmedova, Nils Körber
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.05023
Pdf link: https://arxiv.org/pdf/2406.05023
Abstract Generative adversarial networks (GANs) are machine learning models that are used to estimate the underlying statistical structure of a given dataset and as a result can be used for a variety of tasks such as image generation or anomaly detection. Despite their initial simplicity, designing an effective loss function for training GANs remains challenging, and various loss functions have been proposed aiming to improve the performance and stability of the generative models. In this study, loss function design for GANs is presented as an optimization problem solved using the genetic programming (GP) approach. Initial experiments were carried out using small Deep Convolutional GAN (DCGAN) model and the MNIST dataset, in order to search experimentally for an improved loss function. The functions found were evaluated on CIFAR10, with the best function, named GANetic loss, showing exceptionally better performance and stability compared to the losses commonly used for GAN training. To further evalute its general applicability on more challenging problems, GANetic loss was applied for two medical applications: image generation and anomaly detection. Experiments were performed with histopathological, gastrointestinal or glaucoma images to evaluate the GANetic loss in medical image generation, resulting in improved image quality compared to the baseline models. The GANetic Loss used for polyp and glaucoma images showed a strong improvement in the detection of anomalies. In summary, the GANetic loss function was evaluated on multiple datasets and applications where it consistently outperforms alternative loss functions. Moreover, GANetic loss leads to stable training and reproducible results, a known weak spot of GANs.
Optimizing Automatic Differentiation with Deep Reinforcement Learning
Authors: Jamie Lohoff, Emre Neftci
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.05027
Pdf link: https://arxiv.org/pdf/2406.05027
Abstract Computing Jacobians with automatic differentiation is ubiquitous in many scientific domains such as machine learning, computational fluid dynamics, robotics and finance. Even small savings in the number of computations or memory usage in Jacobian computations can already incur massive savings in energy consumption and runtime. While there exist many methods that allow for such savings, they generally trade computational efficiency for approximations of the exact Jacobian. In this paper, we present a novel method to optimize the number of necessary multiplications for Jacobian computation by leveraging deep reinforcement learning (RL) and a concept called cross-country elimination while still computing the exact Jacobian. Cross-country elimination is a framework for automatic differentiation that phrases Jacobian accumulation as ordered elimination of all vertices on the computational graph where every elimination incurs a certain computational cost. We formulate the search for the optimal elimination order that minimizes the number of necessary multiplications as a single player game which is played by an RL agent. We demonstrate that this method achieves up to 33% improvements over state-of-the-art methods on several relevant tasks taken from diverse domains. Furthermore, we show that these theoretical gains translate into actual runtime improvements by providing a cross-country elimination interpreter in JAX that can efficiently execute the obtained elimination orders.
Provably Better Explanations with Optimized Aggregation of Feature Attributions
Authors: Thomas Decker, Ananta R. Bhattarai, Jindong Gu, Volker Tresp, Florian Buettner
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.05090
Pdf link: https://arxiv.org/pdf/2406.05090
Abstract Using feature attributions for post-hoc explanations is a common practice to understand and verify the predictions of opaque machine learning models. Despite the numerous techniques available, individual methods often produce inconsistent and unstable results, putting their overall reliability into question. In this work, we aim to systematically improve the quality of feature attributions by combining multiple explanations across distinct methods or their variations. For this purpose, we propose a novel approach to derive optimal convex combinations of feature attributions that yield provable improvements of desired quality criteria such as robustness or faithfulness to the model behavior. Through extensive experiments involving various model architectures and popular feature attribution techniques, we demonstrate that our combination strategy consistently outperforms individual methods and existing baselines.

qiaoyuet / arxiv_daily

New submissions for Monday, 10 June 2024 (showing 370 of 370 entries ) #118

Keyword: differential privacy

Tangent differential privacy

Contrastive explainable clustering with differential privacy

Marking the Pace: A Blockchain-Enhanced Privacy-Traceable Strategy for Federated Recommender Systems

Black Box Differential Privacy Auditing Using Total Variation Distance

Perturb-and-Project: Differentially Private Similarities and Marginals

Keyword: privacy

Tangent differential privacy

Contrastive explainable clustering with differential privacy

LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model

Marking the Pace: A Blockchain-Enhanced Privacy-Traceable Strategy for Federated Recommender Systems

When Swarm Learning meets energy series data: A decentralized collaborative learning design based on blockchain

Approximated Coded Computing: Towards Fast, Private and Secure Distributed Machine Learning

Black Box Differential Privacy Auditing Using Total Variation Distance

FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

Perturb-and-Project: Differentially Private Similarities and Marginals

Concept Drift Detection using Ensemble of Integrally Private Models

Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation

Keyword: machine learning

Naming the Pain in Machine Learning-Enabled Systems Engineering

Dynamic Online Recommendation for Two-Sided Market with Bayesian Incentive Compatibility

On Regularization via Early Stopping for Least Squares Regression

Optimizing Autonomous Driving for Safety: A Human-Centric Approach with LLM-Enhanced RLHF

OCCAM: Towards Cost-Efficient and Accuracy-Aware Image Classification Inference

Rare Class Prediction Model for Smart Industry in Semiconductor Manufacturing

GNNAnatomy: Systematic Generation and Evaluation of Multi-Level Explanations for Graph Neural Networks

On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization

A Unified View of Group Fairness Tradeoffs Using Partial Information Decomposition

Contrastive explainable clustering with differential privacy

CTSyn: A Foundational Model for Cross Tabular Data Generation

Advanced Payment Security System:XGBoost, CatBoost and SMOTE Integrated

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

ConDiff: A Challenging Dataset for Neural Solvers of Partial Differential Equations

Unsupervised representation learning with Hebbian synaptic and structural plasticity in brain-like feedforward neural networks

When Swarm Learning meets energy series data: A decentralized collaborative learning design based on blockchain

Approximated Coded Computing: Towards Fast, Private and Secure Distributed Machine Learning

Probabilistic Weather Forecasting with Hierarchical Graph Neural Networks

GENIE: Watermarking Graph Neural Networks for Link Prediction

Black Box Differential Privacy Auditing Using Total Variation Distance

Diversified Batch Selection for Training Acceleration

Beyond Data, Towards Sustainability: A Sydney Case Study on Urban Digital Twins

Concept Drift Detection using Ensemble of Integrally Private Models

AGBD: A Global-scale Biomass Dataset

CarbonSense: A Multimodal Dataset and Baseline for Carbon Flux Modelling

ADBA:Approximation Decision Boundary Approach for Black-Box Adversarial Attacks

Designs for Enabling Collaboration in Human-Machine Teaming via Interactive and Explainable Systems

Scaling up Probabilistic PDE Simulators with Structured Volumetric Information

GANetic Loss for Generative Adversarial Networks with a Focus on Medical Applications

Optimizing Automatic Differentiation with Deep Reinforcement Learning

Provably Better Explanations with Optimized Aggregation of Feature Attributions