New submissions for Monday, 17 June 2024 (showing 382 of 382 entries )

Keyword: differential privacy

Is Diffusion Model Safe? Severe Data Leakage via Gradient-Guided Diffusion Model

Authors: Jiayang Meng, Tao Huang, Hong Chen, Cuiping Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.09484
Pdf link: https://arxiv.org/pdf/2406.09484
Abstract Gradient leakage has been identified as a potential source of privacy breaches in modern image processing systems, where the adversary can completely reconstruct the training images from leaked gradients. However, existing methods are restricted to reconstructing low-resolution images where data leakage risks of image processing systems are not sufficiently explored. In this paper, by exploiting diffusion models, we propose an innovative gradient-guided fine-tuning method and introduce a new reconstruction attack that is capable of stealing private, high-resolution images from image processing systems through leaked gradients where severe data leakage encounters. Our attack method is easy to implement and requires little prior knowledge. The experimental results indicate that current reconstruction attacks can steal images only up to a resolution of $128 \times 128$ pixels, while our attack method can successfully recover and steal images with resolutions up to $512 \times 512$ pixels. Our attack method significantly outperforms the SOTA attack baselines in terms of both pixel-wise accuracy and time efficiency of image reconstruction. Furthermore, our attack can render differential privacy ineffective to some extent.
Keyword: privacy

Is Diffusion Model Safe? Severe Data Leakage via Gradient-Guided Diffusion Model
Authors: Jiayang Meng, Tao Huang, Hong Chen, Cuiping Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.09484
Pdf link: https://arxiv.org/pdf/2406.09484
Abstract Gradient leakage has been identified as a potential source of privacy breaches in modern image processing systems, where the adversary can completely reconstruct the training images from leaked gradients. However, existing methods are restricted to reconstructing low-resolution images where data leakage risks of image processing systems are not sufficiently explored. In this paper, by exploiting diffusion models, we propose an innovative gradient-guided fine-tuning method and introduce a new reconstruction attack that is capable of stealing private, high-resolution images from image processing systems through leaked gradients where severe data leakage encounters. Our attack method is easy to implement and requires little prior knowledge. The experimental results indicate that current reconstruction attacks can steal images only up to a resolution of $128 \times 128$ pixels, while our attack method can successfully recover and steal images with resolutions up to $512 \times 512$ pixels. Our attack method significantly outperforms the SOTA attack baselines in terms of both pixel-wise accuracy and time efficiency of image reconstruction. Furthermore, our attack can render differential privacy ineffective to some extent.
FLea: Addressing Data Scarcity and Label Skew in Federated Learning via Privacy-preserving Feature Augmentation
Authors: Tong Xia, Abhirup Ghosh, Xinchi Qiu, Cecilia Mascolo
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2406.09547
Pdf link: https://arxiv.org/pdf/2406.09547
Abstract Federated Learning (FL) enables model development by leveraging data distributed across numerous edge devices without transferring local data to a central server. However, existing FL methods still face challenges when dealing with scarce and label-skewed data across devices, resulting in local model overfitting and drift, consequently hindering the performance of the global model. In response to these challenges, we propose a pioneering framework called FLea, incorporating the following key components: i) A global feature buffer that stores activation-target pairs shared from multiple clients to support local training. This design mitigates local model drift caused by the absence of certain classes; ii) A feature augmentation approach based on local and global activation mix-ups for local training. This strategy enlarges the training samples, thereby reducing the risk of local overfitting; iii) An obfuscation method to minimize the correlation between intermediate activations and the source data, enhancing the privacy of shared features. To verify the superiority of FLea, we conduct extensive experiments using a wide range of data modalities, simulating different levels of local data scarcity and label skew. The results demonstrate that FLea consistently outperforms state-of-the-art FL counterparts (among 13 of the experimented 18 settings, the improvement is over 5% while concurrently mitigating the privacy vulnerabilities associated with shared features. Code is available at this https URL.
My Body My Choice: Human-Centric Full-Body Anonymization
Authors: Umur Aybars Ciftci, Ali Kemal Tanriverdi, Ilke Demir
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.09553
Pdf link: https://arxiv.org/pdf/2406.09553
Abstract In an era of increasing privacy concerns for our online presence, we propose that the decision to appear in a piece of content should only belong to the owner of the body. Although some automatic approaches for full-body anonymization have been proposed, human-guided anonymization can adapt to various contexts, such as cultural norms, personal relations, esthetic concerns, and security issues. ''My Body My Choice'' (MBMC) enables physical and adversarial anonymization by removal and swapping approaches aimed for four tasks, designed by single or multi, ControlNet or GAN modules, combining several diffusion models. We evaluate anonymization on seven datasets; compare with SOTA inpainting and anonymization methods; evaluate by image, adversarial, and generative metrics; and conduct reidentification experiments.
Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos
Authors: Qingyuan Liu, Pengyuan Shi, Yun-Yun Tsai, Chengzhi Mao, Junfeng Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.09601
Pdf link: https://arxiv.org/pdf/2406.09601
Abstract The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. However, the robustness of these detectors on diffusion-generated videos generated from video creation tools (e.g., SORA by OpenAI, Runway Gen-2, and Pika, etc.) is still unexplored. In this paper, we propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models, such as Stable Video Diffusion. We find that the SOTA methods for detecting diffusion-generated images lack robustness in identifying diffusion-generated videos. Our analysis reveals that the effectiveness of these detectors diminishes when applied to out-of-domain videos, primarily because they struggle to track the temporal features and dynamic variations between frames. To address the above-mentioned challenge, we collect a new benchmark video dataset for diffusion-generated videos using SOTA video creation tools. We extract representation within explicit knowledge from the diffusion model for video frames and train our detector with a CNN + LSTM architecture. The evaluation shows that our framework can well capture the temporal features between frames, achieves 93.7% detection accuracy for in-domain videos, and improves the accuracy of out-domain videos by up to 16 points.
Heterogeneous Federated Learning with Convolutional and Spiking Neural Networks
Authors: Yingchao Yu, Yuping Yan, Jisong Cai, Yaochu Jin
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2406.09680
Pdf link: https://arxiv.org/pdf/2406.09680
Abstract Federated learning (FL) has emerged as a promising paradigm for training models on decentralized data while safeguarding data privacy. Most existing FL systems, however, assume that all machine learning models are of the same type, although it becomes more likely that different edge devices adopt different types of AI models, including both conventional analogue artificial neural networks (ANNs) and biologically more plausible spiking neural networks (SNNs). This diversity empowers the efficient handling of specific tasks and requirements, showcasing the adaptability and versatility of edge computing platforms. One main challenge of such heterogeneous FL system lies in effectively aggregating models from the local devices in a privacy-preserving manner. To address the above issue, this work benchmarks FL systems containing both convoluntional neural networks (CNNs) and SNNs by comparing various aggregation approaches, including federated CNNs, federated SNNs, federated CNNs for SNNs, federated SNNs for CNNs, and federated CNNs with SNN fusion. Experimental results demonstrate that the CNN-SNN fusion framework exhibits the best performance among the above settings on the MNIST dataset. Additionally, intriguing phenomena of competitive suppression are noted during the convergence process of multi-model FL.
Privacy-preserving Quantification of Non-IID Degree in Federated Learning
Authors: Yuping Yan, Yizhi Wang, Yingchao Yu, Yaochu Jin
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.09682
Pdf link: https://arxiv.org/pdf/2406.09682
Abstract Federated learning (FL) offers a privacy-preserving approach to machine learning for multiple collaborators without sharing raw data. However, the existence of non-independent and non-identically distributed (non-IID) datasets across different clients presents a significant challenge to FL, leading to a sharp drop in accuracy, reduced efficiency, and hindered implementation. To address the non-IID problem, various methods have been proposed, including clustering and personalized FL frameworks. Nevertheless, to date, a formal quantitative definition of the non-IID degree between different clients' datasets is still missing, hindering the clients from comparing and obtaining an overview of their data distributions with other clients. For the first time, this paper proposes a quantitative definition of the non-IID degree in the federated environment by employing the cumulative distribution function (CDF), called Fully Homomorphic Encryption-based Federated Cumulative Distribution Function (FHE-FCDF). This method utilizes cryptographic primitive fully homomorphic encryption to enable clients to estimate the non-IID degree while ensuring privacy preservation. The experiments conducted on the CIFAR-100 non-IID dataset validate the effectiveness of our proposed method.
Speed-up of Data Analysis with Kernel Trick in Encrypted Domain
Authors: Joon Soo Yoo, Baek Kyung Song, Tae Min Ahn, Ji Won Heo, Ji Won Yoon
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09716
Pdf link: https://arxiv.org/pdf/2406.09716
Abstract Homomorphic encryption (HE) is pivotal for secure computation on encrypted data, crucial in privacy-preserving data analysis. However, efficiently processing high-dimensional data in HE, especially for machine learning and statistical (ML/STAT) algorithms, poses a challenge. In this paper, we present an effective acceleration method using the kernel method for HE schemes, enhancing time performance in ML/STAT algorithms within encrypted domains. This technique, independent of underlying HE mechanisms and complementing existing optimizations, notably reduces costly HE multiplications, offering near constant time complexity relative to data dimension. Aimed at accessibility, this method is tailored for data scientists and developers with limited cryptography background, facilitating advanced data analysis in secure environments.
Faster Convergence on Heterogeneous Federated Edge Learning: An Adaptive Sidelink-Assisted Data Multicasting Approach
Authors: Gang Hu, Yinglei Teng, Nan Wang, Zhu Han
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09776
Pdf link: https://arxiv.org/pdf/2406.09776
Abstract Federated Edge Learning (FEEL) emerges as a pioneering distributed machine learning paradigm for the 6G Hyper-Connectivity, harnessing data from the Internet of Things (IoT) devices while upholding data privacy. However, current FEEL algorithms struggle with non-independent and non-identically distributed (non-IID) data, leading to elevated communication costs and compromised model accuracy. To address these statistical imbalances within FEEL, we introduce a clustered data sharing framework, mitigating data heterogeneity by selectively sharing partial data from cluster heads to trusted associates through sidelink-aided multicasting. The collective communication pattern is integral to FEEL training, where both cluster formation and the efficiency of communication and computation impact training latency and accuracy simultaneously. To tackle the strictly coupled data sharing and resource optimization, we decompose the overall optimization problem into the clients clustering and effective data sharing subproblems. Specifically, a distribution-based adaptive clustering algorithm (DACA) is devised basing on three deductive cluster forming conditions, which ensures the maximum sharing yield. Meanwhile, we design a stochastic optimization based joint computed frequency and shared data volume optimization (JFVO) algorithm, determining the optimal resource allocation with an uncertain objective function. The experiments show that the proposed framework facilitates FEEL on non-IID datasets with faster convergence rate and higher model accuracy in a limited communication environment.
Same App, Different Behaviors: Uncovering Device-specific Behaviors in Android Apps
Authors: Zikan Dong, Yanjie Zhao, Tianming Liu, Chao Wang, Guosheng Xu, Guoai Xu, Haoyu Wang
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2406.09807
Pdf link: https://arxiv.org/pdf/2406.09807
Abstract The Android ecosystem faces a notable challenge known as fragmentation, which denotes the extensive diversity within the system. This issue is mainly related to differences in system versions, device hardware specifications, and customizations introduced by manufacturers. The growing divergence among devices leads to marked variations in how a given app behaves across diverse devices. This is referred to as device-specific behaviors. In this work, we present the first large-scale empirical study of device-specific behaviors in real-world Android apps. We have designed a three-phase static analysis framework to accurately detect and understand the device-specific behaviors. Upon employing our tool on a dataset comprising more than 20,000 apps, we detected device-specific behaviors in 2,357 of them. By examining the distribution of device-specific behaviors, our analysis revealed that apps within the Chinese third-party app market exhibit more relevant behaviors compared to their counterparts in Google Play. Additionally, these behaviors are more likely to feature dominant brands that hold larger market shares. Reflecting this, we have classified these device-specific behaviors into 29 categories based on implemented functionalities, providing structured insight into these behaviors. Beyond common behaviors like issue fixes and feature adaptations, we observed 33 aggressive apps, including popular ones with millions of downloads, abusing system properties of customized ROMs to obtain user-unresettable identifiers without requiring permission, substantially impacting user privacy. Finally, we investigated the origins of device-specific behaviors, revealing significant challenges developers face in implementing them comprehensively. Our research sheds light on the promising but less touched research direction of device-specific behaviors, benefiting community stakeholders.
Federated Learning driven Large Language Models for Swarm Intelligence: A Survey
Authors: Youyang Qu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2406.09831
Pdf link: https://arxiv.org/pdf/2406.09831
Abstract Federated learning (FL) offers a compelling framework for training large language models (LLMs) while addressing data privacy and decentralization challenges. This paper surveys recent advancements in the federated learning of large language models, with a particular focus on machine unlearning, a crucial aspect for complying with privacy regulations like the Right to be Forgotten. Machine unlearning in the context of federated LLMs involves systematically and securely removing individual data contributions from the learned model without retraining from scratch. We explore various strategies that enable effective unlearning, such as perturbation techniques, model decomposition, and incremental learning, highlighting their implications for maintaining model performance and data privacy. Furthermore, we examine case studies and experimental results from recent literature to assess the effectiveness and efficiency of these approaches in real-world scenarios. Our survey reveals a growing interest in developing more robust and scalable federated unlearning methods, suggesting a vital area for future research in the intersection of AI ethics and distributed machine learning technologies.
Dataset Condensation with Latent Quantile Matching
Authors: Wei Wei, Tom De Schepper, Kevin Mets
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.09860
Pdf link: https://arxiv.org/pdf/2406.09860
Abstract Dataset condensation (DC) methods aim to learn a smaller synthesized dataset with informative data records to accelerate the training of machine learning models. Current distribution matching (DM) based DC methods learn a synthesized dataset by matching the mean of the latent embeddings between the synthetic and the real dataset. However two distributions with the same mean can still be vastly different. In this work we demonstrate the shortcomings of using Maximum Mean Discrepancy to match latent distributions i.e. the weak matching power and lack of outlier regularization. To alleviate these shortcomings we propose our new method: Latent Quantile Matching (LQM) which matches the quantiles of the latent embeddings to minimize the goodness of fit test statistic between two distributions. Empirical experiments on both image and graph-structured datasets show that LQM matches or outperforms previous state of the art in distribution matching based DC. Moreover we show that LQM improves the performance in continual graph learning (CGL) setting where memory efficiency and privacy can be important. Our work sheds light on the application of DM based DC for CGL.
Consistent Update Synthesis via Privatized Beliefs
Authors: Thomas Schlögl, Roman Kuznets, Giorgio Cignarale
Subjects: Subjects: Logic in Computer Science (cs.LO); Logic (math.LO)
Arxiv link: https://arxiv.org/abs/2406.10010
Pdf link: https://arxiv.org/pdf/2406.10010
Abstract Kripke models are an effective and widely used tool for representing epistemic attitudes of agents in multi-agent systems, including distributed systems. Dynamic Epistemic Logic (DEL) adds communication in the form of model transforming updates. Private communication is key in distributed systems as processes exchanging (potentially corrupted) information about their private local state should not be detectable by any other processes. This focus on privacy clashes with the standard DEL assumption for which updates are applied to the whole Kripke model, which is usually commonly known by all agents, potentially leading to information leakage. In addition, a commonly known model cannot minimize the corruption of agents' local states due to fault information dissemination. The contribution of this paper is twofold: (I) To represent leak-free agent-to-agent communication, we introduce a way to synthesize an action model which stratifies a pointed Kripke model into private agent-clusters, each representing the local knowledge of the processes: Given a goal formula $\varphi$ representing the effect of private communication, we provide a procedure to construct an action model that (a) makes the goal formula true, (b) maintain consistency of agents' beliefs, if possible, without causing "unrelated" beliefs (minimal change) thus minimizing the corruption of local states in case of inconsistent information. (II) We introduce a new operation between pointed Kripke models and pointed action models called pointed updates which, unlike the product update operation of DEL, maintain only the subset of the world-event pairs that are reachable from the point, without unnecessarily blowing up the model size.
Unobtrusive Monitoring of Physical Weakness: A Simulated Approach
Authors: Chen Long-fei, Muhammad Ahmed Raza, Craig Innes, Subramanian Ramamoorthy, Robert B. Fisher
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.10045
Pdf link: https://arxiv.org/pdf/2406.10045
Abstract Aging and chronic conditions affect older adults' daily lives, making early detection of developing health issues crucial. Weakness, common in many conditions, alters physical movements and daily activities subtly. However, detecting such changes can be challenging due to their subtle and gradual nature. To address this, we employ a non-intrusive camera sensor to monitor individuals' daily sitting and relaxing activities for signs of weakness. We simulate weakness in healthy subjects by having them perform physical exercise and observing the behavioral changes in their daily activities before and after workouts. The proposed system captures fine-grained features related to body motion, inactivity, and environmental context in real-time while prioritizing privacy. A Bayesian Network is used to model the relationships between features, activities, and health conditions. We aim to identify specific features and activities that indicate such changes and determine the most suitable time scale for observing the change. Results show 0.97 accuracy in distinguishing simulated weakness at the daily level. Fine-grained behavioral features, including non-dominant upper body motion speed and scale, and inactivity distribution, along with a 300-second window, are found most effective. However, individual-specific models are recommended as no universal set of optimal features and activities was identified across all participants.
Data Ethics in the Era of Healthcare Artificial Intelligence in Africa: An Ubuntu Philosophy Perspective
Authors: Abdoul Jalil Djiberou Mahamadou, Aloysius Ochasi, Russ B. Altman
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.10121
Pdf link: https://arxiv.org/pdf/2406.10121
Abstract Data are essential in developing healthcare artificial intelligence (AI) systems. However, patient data collection, access, and use raise ethical concerns, including informed consent, data bias, data protection and privacy, data ownership, and benefit sharing. Various ethical frameworks have been proposed to ensure the ethical use of healthcare data and AI, however, these frameworks often align with Western cultural values, social norms, and institutional contexts emphasizing individual autonomy and well-being. Ethical guidelines must reflect political and cultural settings to account for cultural diversity, inclusivity, and historical factors such as colonialism. Thus, this paper discusses healthcare data ethics in the AI era in Africa from the Ubuntu philosophy perspective. It focuses on the contrast between individualistic and communitarian approaches to data ethics. The proposed framework could inform stakeholders, including AI developers, healthcare providers, the public, and policy-makers about healthcare data ethical usage in AI in Africa.
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
Authors: Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, Tom Goldstein
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2406.10209
Pdf link: https://arxiv.org/pdf/2406.10209
Abstract Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.
Keyword: machine learning

Simulating Realistic Post-Stroke Reaching Kinematics with Generative Adversarial Networks
Authors: Aaron J. Hadley, Christopher L. Pulliam
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09451
Pdf link: https://arxiv.org/pdf/2406.09451
Abstract The generalizability of machine learning (ML) models for wearable monitoring in stroke rehabilitation is often constrained by the limited scale and heterogeneity of available data. Data augmentation addresses this challenge by adding computationally derived data to real data to enrich the variability represented in the training set. Traditional augmentation methods, such as rotation, permutation, and time-warping, have shown some benefits in improving classifier performance, but often fail to produce realistic training examples. This study employs Conditional Generative Adversarial Networks (cGANs) to create synthetic kinematic data from a publicly available dataset, closely mimicking the experimentally measured reaching movements of stroke survivors. This approach not only captures the complex temporal dynamics and common movement patterns after stroke, but also significantly enhances the training dataset. By training deep learning models on both synthetic and experimental data, we achieved a substantial enhancement in task classification accuracy: models incorporating synthetic data attained an overall accuracy of 80.2%, significantly higher than the 63.1% seen in models trained solely with real data. These improvements allow for more precise task classification, offering clinicians the potential to monitor patient progress more accurately and tailor rehabilitation interventions more effectively.
FeatNavigator: Automatic Feature Augmentation on Tabular Data
Authors: Jiaming Liang, Chuan Lei, Xiao Qin, Jiani Zhang, Asterios Katsifodimos, Christos Faloutsos, Huzefa Rangwala
Subjects: Subjects: Databases (cs.DB); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09534
Pdf link: https://arxiv.org/pdf/2406.09534
Abstract Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.
Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale
Authors: A. Feder Cooper
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.09548
Pdf link: https://arxiv.org/pdf/2406.09548
Abstract To develop rigorous knowledge about ML models -- and the systems in which they are embedded -- we need reliable measurements. But reliable measurement is fundamentally challenging, and touches on issues of reproducibility, scalability, uncertainty quantification, epistemology, and more. This dissertation addresses criteria needed to take reliability seriously: both criteria for designing meaningful metrics, and for methodologies that ensure that we can dependably and efficiently measure these metrics at scale and in practice. In doing so, this dissertation articulates a research vision for a new field of scholarship at the intersection of machine learning, law, and policy. Within this frame, we cover topics that fit under three different themes: (1) quantifying and mitigating sources of arbitrariness in ML, (2) taming randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability, and (3) providing methods for evaluating generative-AI systems, with specific focuses on quantifying memorization in language models and training latent diffusion models on open-licensed data. By making contributions in these three themes, this dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately and inescapably bound up with research in law and policy. These different disciplines pose similar research questions about reliable measurement in machine learning. They are, in fact, two complementary sides of the same research vision, which, broadly construed, aims to construct machine-learning systems that cohere with broader societal values.
Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis
Authors: Zongyue Qin, Yunsheng Bai, Atefeh Sograbizadeh, Zijian Ding, Ziniu Hu, Yizhou Sun, Jason Cong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2406.09606
Pdf link: https://arxiv.org/pdf/2406.09606
Abstract In recent years, domain-specific accelerators (DSAs) have gained popularity for applications such as deep learning and autonomous driving. To facilitate DSA designs, programmers use high-level synthesis (HLS) to compile a high-level description written in C/C++ into a design with low-level hardware description languages that eventually synthesize DSAs on circuits. However, creating a high-quality HLS design still demands significant domain knowledge, particularly in microarchitecture decisions expressed as \textit{pragmas}. Thus, it is desirable to automate such decisions with the help of machine learning for predicting the quality of HLS designs, requiring a deeper understanding of the program that consists of original code and pragmas. Naturally, these programs can be considered as sequence data. In addition, these programs can be compiled and converted into a control data flow graph (CDFG). But existing works either fail to leverage both modalities or combine the two in shallow or coarse ways. We propose ProgSG, a model that allows interaction between the source code sequence modality and the graph modality in a deep and fine-grained way. To alleviate the scarcity of labeled designs, a pre-training method is proposed based on a suite of compiler's data flow analysis tasks. Experimental results show that ProgSG reduces the RMSE of design performance predictions by up to $22\%$, and identifies designs with an average of $1.10\times$ and $1.26\times$ (up to $8.17\times$ and $13.31\times$) performance improvement in design space exploration (DSE) task compared to HARP and AutoDSE, respectively.
Multi-Modal Retrieval For Large Language Model Based Speech Recognition
Authors: Jari Kolehmainen, Aditya Gourav, Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ariya Rastrow, Grant Strimel, Ivan Bulyko
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2406.09618
Pdf link: https://arxiv.org/pdf/2406.09618
Abstract Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We demonstrate the effectiveness of our retrieval approaches empirically by applying them to automatic speech recognition tasks with access to external information. Under this setting, we show that speech-based multi-modal retrieval outperforms text based retrieval, and yields up to 50 % improvement in word error rate over the multi-modal language model baseline. Furthermore, we achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks
Authors: Mohamed Elrefaie, Florin Morar, Angela Dai, Faez Ahmed
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2406.09624
Pdf link: https://arxiv.org/pdf/2406.09624
Abstract We present DrivAerNet++, the largest and most comprehensive multimodal dataset for aerodynamic car design. DrivAerNet++ comprises 8,000 diverse car designs modeled with high-fidelity computational fluid dynamics (CFD) simulations. The dataset includes diverse car configurations such as fastback, notchback, and estateback, with different underbody and wheel designs to represent both internal combustion engines and electric vehicles. Each entry in the dataset features detailed 3D meshes, parametric models, aerodynamic coefficients, and extensive flow and surface field data, along with segmented parts for car classification and point cloud data. This dataset supports a wide array of machine learning applications including data-driven design optimization, generative modeling, surrogate model training, CFD simulation acceleration, and geometric classification. With more than 39 TB of publicly available engineering data, DrivAerNet++ fills a significant gap in available resources, providing high-quality, diverse data to enhance model training, promote generalization, and accelerate automotive design processes. Along with rigorous dataset validation, we also provide ML benchmarking results on the task of aerodynamic drag prediction, showcasing the breadth of applications supported by our dataset. This dataset is set to significantly impact automotive design and broader engineering disciplines by fostering innovation and improving the fidelity of aerodynamic evaluations.
Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition
Authors: Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09630
Pdf link: https://arxiv.org/pdf/2406.09630
Abstract We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.
Learning Language Structures through Grounding
Authors: Freda Shi
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.09662
Pdf link: https://arxiv.org/pdf/2406.09662
Abstract Language is highly structured, with syntactic and semantic structures, to some extent, agreed upon by speakers of the same language. With implicit or explicit awareness of such structures, humans can learn and use language efficiently and generalize to sentences that contain unseen words. Motivated by human language learning, in this dissertation, we consider a family of machine learning tasks that aim to learn language structures through grounding. We seek distant supervision from other data sources (i.e., grounds), including but not limited to other modalities (e.g., vision), execution results of programs, and other languages. We demonstrate the potential of this task formulation and advocate for its adoption through three schemes. In Part I, we consider learning syntactic parses through visual grounding. We propose the task of visually grounded grammar induction, present the first models to induce syntactic structures from visually grounded text and speech, and find that the visual grounding signals can help improve the parsing quality over language-only models. As a side contribution, we propose a novel evaluation metric that enables the evaluation of speech parsing without text or automatic speech recognition systems involved. In Part II, we propose two execution-aware methods to map sentences into corresponding semantic structures (i.e., programs), significantly improving compositional generalization and few-shot program synthesis. In Part III, we propose methods that learn language structures from annotations in other languages. Specifically, we propose a method that sets a new state of the art on cross-lingual word alignment. We then leverage the learned word alignments to improve the performance of zero-shot cross-lingual dependency parsing, by proposing a novel substructure-based projection method that preserves structural knowledge learned from the source language.
Heterogeneous Federated Learning with Convolutional and Spiking Neural Networks
Authors: Yingchao Yu, Yuping Yan, Jisong Cai, Yaochu Jin
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2406.09680
Pdf link: https://arxiv.org/pdf/2406.09680
Abstract Federated learning (FL) has emerged as a promising paradigm for training models on decentralized data while safeguarding data privacy. Most existing FL systems, however, assume that all machine learning models are of the same type, although it becomes more likely that different edge devices adopt different types of AI models, including both conventional analogue artificial neural networks (ANNs) and biologically more plausible spiking neural networks (SNNs). This diversity empowers the efficient handling of specific tasks and requirements, showcasing the adaptability and versatility of edge computing platforms. One main challenge of such heterogeneous FL system lies in effectively aggregating models from the local devices in a privacy-preserving manner. To address the above issue, this work benchmarks FL systems containing both convoluntional neural networks (CNNs) and SNNs by comparing various aggregation approaches, including federated CNNs, federated SNNs, federated CNNs for SNNs, federated SNNs for CNNs, and federated CNNs with SNN fusion. Experimental results demonstrate that the CNN-SNN fusion framework exhibits the best performance among the above settings on the MNIST dataset. Additionally, intriguing phenomena of competitive suppression are noted during the convergence process of multi-model FL.
Privacy-preserving Quantification of Non-IID Degree in Federated Learning
Authors: Yuping Yan, Yizhi Wang, Yingchao Yu, Yaochu Jin
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.09682
Pdf link: https://arxiv.org/pdf/2406.09682
Abstract Federated learning (FL) offers a privacy-preserving approach to machine learning for multiple collaborators without sharing raw data. However, the existence of non-independent and non-identically distributed (non-IID) datasets across different clients presents a significant challenge to FL, leading to a sharp drop in accuracy, reduced efficiency, and hindered implementation. To address the non-IID problem, various methods have been proposed, including clustering and personalized FL frameworks. Nevertheless, to date, a formal quantitative definition of the non-IID degree between different clients' datasets is still missing, hindering the clients from comparing and obtaining an overview of their data distributions with other clients. For the first time, this paper proposes a quantitative definition of the non-IID degree in the federated environment by employing the cumulative distribution function (CDF), called Fully Homomorphic Encryption-based Federated Cumulative Distribution Function (FHE-FCDF). This method utilizes cryptographic primitive fully homomorphic encryption to enable clients to estimate the non-IID degree while ensuring privacy preservation. The experiments conducted on the CIFAR-100 non-IID dataset validate the effectiveness of our proposed method.
Explainable AI for Comparative Analysis of Intrusion Detection Models
Authors: Pap M. Corea, Yongxin Liu, Jian Wang, Shuteng Niu, Houbing Song
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2406.09684
Pdf link: https://arxiv.org/pdf/2406.09684
Abstract Explainable Artificial Intelligence (XAI) has become a widely discussed topic, the related technologies facilitate better understanding of conventional black-box models like Random Forest, Neural Networks and etc. However, domain-specific applications of XAI are still insufficient. To fill this gap, this research analyzes various machine learning models to the tasks of binary and multi-class classification for intrusion detection from network traffic on the same dataset using occlusion sensitivity. The models evaluated include Linear Regression, Logistic Regression, Linear Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest, Decision Trees, and Multi-Layer Perceptrons (MLP). We trained all models to the accuracy of 90\% on the UNSW-NB15 Dataset. We found that most classifiers leverage only less than three critical features to achieve such accuracies, indicating that effective feature engineering could actually be far more important for intrusion detection than applying complicated models. We also discover that Random Forest provides the best performance in terms of accuracy, time efficiency and robustness. Data and code available at this https URL
FreeCtrl: Constructing Control Centers with Feedforward Layers for Learning-Free Controllable Text Generation
Authors: Zijian Feng, Hanzhang Zhou, Zixiao Zhu, Kezhi Mao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2406.09688
Pdf link: https://arxiv.org/pdf/2406.09688
Abstract Controllable text generation (CTG) seeks to craft texts adhering to specific attributes, traditionally employing learning-based techniques such as training, fine-tuning, or prefix-tuning with attribute-specific datasets. These approaches, while effective, demand extensive computational and data resources. In contrast, some proposed learning-free alternatives circumvent learning but often yield inferior results, exemplifying the fundamental machine learning trade-off between computational expense and model efficacy. To overcome these limitations, we propose FreeCtrl, a learning-free approach that dynamically adjusts the weights of selected feedforward neural network (FFN) vectors to steer the outputs of large language models (LLMs). FreeCtrl hinges on the principle that the weights of different FFN vectors influence the likelihood of different tokens appearing in the output. By identifying and adaptively adjusting the weights of attribute-related FFN vectors, FreeCtrl can control the output likelihood of attribute keywords in the generated content. Extensive experiments on single- and multi-attribute control reveal that the learning-free FreeCtrl outperforms other learning-free and learning-based methods, successfully resolving the dilemma between learning costs and model performance.
Differentiable Programming for Differential Equations: A Review
Authors: Facundo Sapienza, Jordi Bolibar, Frank Schäfer, Brian Groenke, Avik Pal, Victor Boussange, Patrick Heimbach, Giles Hooker, Fernando Pérez, Per-Olof Persson, Christopher Rackauckas
Subjects: Subjects: Numerical Analysis (math.NA); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.09699
Pdf link: https://arxiv.org/pdf/2406.09699
Abstract The differentiable programming paradigm is a cornerstone of modern scientific computing. It refers to numerical methods for computing the gradient of a numerical model's output. Many scientific models are based on differential equations, where differentiable programming plays a crucial role in calculating model sensitivities, inverting model parameters, and training hybrid models that combine differential equations with data-driven approaches. Furthermore, recognizing the strong synergies between inverse methods and machine learning offers the opportunity to establish a coherent framework applicable to both fields. Differentiating functions based on the numerical solution of differential equations is non-trivial. Numerous methods based on a wide variety of paradigms have been proposed in the literature, each with pros and cons specific to the type of problem investigated. Here, we provide a comprehensive review of existing techniques to compute derivatives of numerical solutions of differential equations. We first discuss the importance of gradients of solutions of differential equations in a variety of scientific domains. Second, we lay out the mathematical foundations of the various approaches and compare them with each other. Third, we cover the computational considerations and explore the solutions available in modern scientific software. Last but not least, we provide best-practices and recommendations for practitioners. We hope that this work accelerates the fusion of scientific models and data, and fosters a modern approach to scientific modelling.
Speed-up of Data Analysis with Kernel Trick in Encrypted Domain
Authors: Joon Soo Yoo, Baek Kyung Song, Tae Min Ahn, Ji Won Heo, Ji Won Yoon
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09716
Pdf link: https://arxiv.org/pdf/2406.09716
Abstract Homomorphic encryption (HE) is pivotal for secure computation on encrypted data, crucial in privacy-preserving data analysis. However, efficiently processing high-dimensional data in HE, especially for machine learning and statistical (ML/STAT) algorithms, poses a challenge. In this paper, we present an effective acceleration method using the kernel method for HE schemes, enhancing time performance in ML/STAT algorithms within encrypted domains. This technique, independent of underlying HE mechanisms and complementing existing optimizations, notably reduces costly HE multiplications, offering near constant time complexity relative to data dimension. Aimed at accessibility, this method is tailored for data scientists and developers with limited cryptography background, facilitating advanced data analysis in secure environments.
Cross-view geo-localization: a survey
Authors: Abhilash Durgam, Sidike Paheding, Vikas Dhiman, Vijay Devabhaktuni
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09722
Pdf link: https://arxiv.org/pdf/2406.09722
Abstract Cross-view geo-localization has garnered notable attention in the realm of computer vision, spurred by the widespread availability of copious geotagged datasets and the advancements in machine learning techniques. This paper provides a thorough survey of cutting-edge methodologies, techniques, and associated challenges that are integral to this domain, with a focus on feature-based and deep learning strategies. Feature-based methods capitalize on unique features to establish correspondences across disparate viewpoints, whereas deep learning-based methodologies deploy convolutional neural networks to embed view-invariant attributes. This work also delineates the multifaceted challenges encountered in cross-view geo-localization, such as variations in viewpoints and illumination, the occurrence of occlusions, and it elucidates innovative solutions that have been formulated to tackle these issues. Furthermore, we delineate benchmark datasets and relevant evaluation metrics, and also perform a comparative analysis of state-of-the-art techniques. Finally, we conclude the paper with a discussion on prospective avenues for future research and the burgeoning applications of cross-view geo-localization in an intricately interconnected global landscape.
A Multivocal Review of MLOps Practices, Challenges and Open Issues
Authors: Beyza Eken, Samodha Pallewatta, Nguyen Khoi Tran, Ayse Tosun, Muhammad Ali Babar
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2406.09737
Pdf link: https://arxiv.org/pdf/2406.09737
Abstract With the increasing trend of Machine Learning (ML) enabled software applications, the paradigm of ML Operations (MLOps) has gained tremendous attention of researchers and practitioners. MLOps encompasses the practices and technologies for streamlining the resources and monitoring needs of operationalizing ML models. Software development practitioners need access to the detailed and easily understandable knowledge of MLOps workflows, practices, challenges and solutions to effectively and efficiently support the adoption of MLOps. Whilst the academic and industry literature on the MLOps has been growing rapidly, there have been relatively a few attempts at systematically synthesizing and analyzing the vast amount of existing literature of MLOps for improving ease of access and understanding. We conducted a Multivocal Literature Review (MLR) of 150 relevant academic studies and 48 gray literature to provide a comprehensive body of knowledge on MLOps. Through this MLR, we identified the emerging MLOps practices, adoption challenges and solutions related to various areas, including development and operation of complex pipelines, managing production at scale, managing artifacts, and ensuring quality, security, governance, and ethical aspects. We also report the socio-technical aspect of MLOps relating to diverse roles involved and collaboration practices across them through the MLOps lifecycle. We assert that this MLR provides valuable insights to researchers and practitioners seeking to navigate the rapidly evolving landscape of MLOps. We also identify the open issues that need to be addressed in order to advance the current state-of-the-art of MLOps.
Faster Convergence on Heterogeneous Federated Edge Learning: An Adaptive Sidelink-Assisted Data Multicasting Approach
Authors: Gang Hu, Yinglei Teng, Nan Wang, Zhu Han
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09776
Pdf link: https://arxiv.org/pdf/2406.09776
Abstract Federated Edge Learning (FEEL) emerges as a pioneering distributed machine learning paradigm for the 6G Hyper-Connectivity, harnessing data from the Internet of Things (IoT) devices while upholding data privacy. However, current FEEL algorithms struggle with non-independent and non-identically distributed (non-IID) data, leading to elevated communication costs and compromised model accuracy. To address these statistical imbalances within FEEL, we introduce a clustered data sharing framework, mitigating data heterogeneity by selectively sharing partial data from cluster heads to trusted associates through sidelink-aided multicasting. The collective communication pattern is integral to FEEL training, where both cluster formation and the efficiency of communication and computation impact training latency and accuracy simultaneously. To tackle the strictly coupled data sharing and resource optimization, we decompose the overall optimization problem into the clients clustering and effective data sharing subproblems. Specifically, a distribution-based adaptive clustering algorithm (DACA) is devised basing on three deductive cluster forming conditions, which ensures the maximum sharing yield. Meanwhile, we design a stochastic optimization based joint computed frequency and shared data volume optimization (JFVO) algorithm, determining the optimal resource allocation with an uncertain objective function. The experiments show that the proposed framework facilitates FEEL on non-IID datasets with faster convergence rate and higher model accuracy in a limited communication environment.
Implementing a Machine Learning Deformer for CG Crowds: Our Journey
Authors: Bastien Arcelin, Sebastien Maraux, Nicolas Chaverou
Subjects: Subjects: Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2406.09783
Pdf link: https://arxiv.org/pdf/2406.09783
Abstract CG crowds have become increasingly popular this last decade in the VFX and animation industry: formerly reserved to only a few high end studios and blockbusters, they are now widely used in TV shows or commercials. Yet, there is still one major limitation: in order to be ingested properly in crowd software, studio rigs have to comply with specific prerequisites, especially in terms of deformations. Usually only skinning, blend shapes and geometry caches are supported preventing close-up shots with facial performances on crowd characters. We envisioned two approaches to tackle this: either reverse engineer the hundreds of deformer nodes available in the major DCCs/plugins and incorporate them in our crowd package, or surf the machine learning wave to compress the deformations of a rig using a neural network architecture. Considering we could not commit 5+ man/years of development into this problem, and that we were excited to dip our toes in the machine learning pool, we went for the latter. From our first tests to a minimum viable product, we went through hopes and disappointments: we hit multiple pitfalls, took false shortcuts and dead ends before reaching our destination. With this paper, we hope to provide a valuable feedback by sharing the lessons we learnt from this experience.
Soil nitrogen forecasting from environmental variables provided by multisensor remote sensing images
Authors: Weiying Zhao, Ganzorig Chuluunbat, Aleksei Unagaev, Natalia Efremova
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2406.09812
Pdf link: https://arxiv.org/pdf/2406.09812
Abstract This study introduces a framework for forecasting soil nitrogen content, leveraging multi-modal data, including multi-sensor remote sensing images and advanced machine learning methods. We integrate the Land Use/Land Cover Area Frame Survey (LUCAS) database, which covers European and UK territory, with environmental variables from satellite sensors to create a dataset of novel features. We further test a broad range of machine learning algorithms, focusing on tree-based models such as CatBoost, LightGBM, and XGBoost. We test the proposed methods with a variety of land cover classes, including croplands and grasslands to ensure the robustness of this approach. Our results demonstrate that the CatBoost model surpasses other methods in accuracy. This research advances the field of agricultural management and environmental monitoring and demonstrates the significant potential of integrating multisensor remote sensing data with machine learning for environmental analysis.
Federated Learning driven Large Language Models for Swarm Intelligence: A Survey
Authors: Youyang Qu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2406.09831
Pdf link: https://arxiv.org/pdf/2406.09831
Abstract Federated learning (FL) offers a compelling framework for training large language models (LLMs) while addressing data privacy and decentralization challenges. This paper surveys recent advancements in the federated learning of large language models, with a particular focus on machine unlearning, a crucial aspect for complying with privacy regulations like the Right to be Forgotten. Machine unlearning in the context of federated LLMs involves systematically and securely removing individual data contributions from the learned model without retraining from scratch. We explore various strategies that enable effective unlearning, such as perturbation techniques, model decomposition, and incremental learning, highlighting their implications for maintaining model performance and data privacy. Furthermore, we examine case studies and experimental results from recent literature to assess the effectiveness and efficiency of these approaches in real-world scenarios. Our survey reveals a growing interest in developing more robust and scalable federated unlearning methods, suggesting a vital area for future research in the intersection of AI ethics and distributed machine learning technologies.
Dataset Condensation with Latent Quantile Matching
Authors: Wei Wei, Tom De Schepper, Kevin Mets
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.09860
Pdf link: https://arxiv.org/pdf/2406.09860
Abstract Dataset condensation (DC) methods aim to learn a smaller synthesized dataset with informative data records to accelerate the training of machine learning models. Current distribution matching (DM) based DC methods learn a synthesized dataset by matching the mean of the latent embeddings between the synthetic and the real dataset. However two distributions with the same mean can still be vastly different. In this work we demonstrate the shortcomings of using Maximum Mean Discrepancy to match latent distributions i.e. the weak matching power and lack of outlier regularization. To alleviate these shortcomings we propose our new method: Latent Quantile Matching (LQM) which matches the quantiles of the latent embeddings to minimize the goodness of fit test statistic between two distributions. Empirical experiments on both image and graph-structured datasets show that LQM matches or outperforms previous state of the art in distribution matching based DC. Moreover we show that LQM improves the performance in continual graph learning (CGL) setting where memory efficiency and privacy can be important. Our work sheds light on the application of DM based DC for CGL.
Positive-Unlabelled Learning for Identifying New Candidate Dietary Restriction-related Genes among Ageing-related Genes
Authors: Jorge Paz-Ruza, Alex A. Freitas, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.09898
Pdf link: https://arxiv.org/pdf/2406.09898
Abstract Dietary Restriction (DR) is one of the most popular anti-ageing interventions, prompting exhaustive research into genes associated with its mechanisms. Recently, Machine Learning (ML) has been explored to identify potential DR-related genes among ageing-related genes, aiming to minimize costly wet lab experiments needed to expand our knowledge on DR. However, to train a model from positive (DR-related) and negative (non-DR-related) examples, existing ML methods naively label genes without known DR relation as negative examples, assuming that lack of DR-related annotation for a gene represents evidence of absence of DR-relatedness, rather than absence of evidence; this hinders the reliability of the negative examples (non-DR-related genes) and the method's ability to identify novel DR-related genes. This work introduces a novel gene prioritization method based on the two-step Positive-Unlabelled (PU) Learning paradigm: using a similarity-based, KNN-inspired approach, our method first selects reliable negative examples among the genes without known DR associations. Then, these reliable negatives and all known positives are used to train a classifier that effectively differentiates DR-related and non-DR-related genes, which is finally employed to generate a more reliable ranking of promising genes for novel DR-relatedness. Our method significantly outperforms the existing state-of-the-art non-PU approach for DR-relatedness prediction in three relevant performance metrics. In addition, curation of existing literature finds support for the top-ranked candidate DR-related genes identified by our model.
Learning Solution-Aware Transformers for Efficiently Solving Quadratic Assignment Problem
Authors: Zhentao Tan, Yadong Mu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2406.09899
Pdf link: https://arxiv.org/pdf/2406.09899
Abstract Recently various optimization problems, such as Mixed Integer Linear Programming Problems (MILPs), have undergone comprehensive investigation, leveraging the capabilities of machine learning. This work focuses on learning-based solutions for efficiently solving the Quadratic Assignment Problem (QAPs), which stands as a formidable challenge in combinatorial optimization. While many instances of simpler problems admit fully polynomial-time approximate solution (FPTAS), QAP is shown to be strongly NP-hard. Even finding a FPTAS for QAP is difficult, in the sense that the existence of a FPTAS implies $P = NP$. Current research on QAPs suffer from limited scale and computational inefficiency. To attack the aforementioned issues, we here propose the first solution of its kind for QAP in the learn-to-improve category. This work encodes facility and location nodes separately, instead of forming computationally intensive association graphs prevalent in current approaches. This design choice enables scalability to larger problem sizes. Furthermore, a \textbf{S}olution \textbf{AW}are \textbf{T}ransformer (SAWT) architecture integrates the incumbent solution matrix with the attention score to effectively capture higher-order information of the QAPs. Our model's effectiveness is validated through extensive experiments on self-generated QAP instances of varying sizes and the QAPLIB benchmark.
Impact of Speech Mode in Automatic Pathological Speech Detection
Authors: Shakeel A. Sheikh, Ina Kodrasi
Subjects: Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2406.09968
Pdf link: https://arxiv.org/pdf/2406.09968
Abstract Automatic pathological speech detection approaches yield promising results in identifying various pathologies. These approaches are typically designed and evaluated for phonetically-controlled speech scenarios, where speakers are prompted to articulate identical phonetic content. While gathering controlled speech recordings can be laborious, spontaneous speech can be conveniently acquired as potential patients navigate their daily routines. Further, spontaneous speech can be valuable in detecting subtle and abstract cues of pathological speech. Nonetheless, the efficacy of automatic pathological speech detection for spontaneous speech remains unexplored. This paper analyzes the influence of speech mode on pathological speech detection approaches, examining two distinct categories of approaches, i.e., classical machine learning and deep learning. Results indicate that classical approaches may struggle to capture pathology-discriminant cues in spontaneous speech. In contrast, deep learning approaches demonstrate superior performance, managing to extract additional cues that were previously inaccessible in non-spontaneous speech
Challenges in explaining deep learning models for data with biological variation
Authors: Lenka Tětková, Erik Schou Dreier, Robin Malm, Lars Kai Hansen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2406.09981
Pdf link: https://arxiv.org/pdf/2406.09981
Abstract Much machine learning research progress is based on developing models and evaluating them on a benchmark dataset (e.g., ImageNet for images). However, applying such benchmark-successful methods to real-world data often does not work as expected. This is particularly the case for biological data where we expect variability at multiple time and spatial scales. In this work, we are using grain data and the goal is to detect diseases and damages. Pink fusarium, skinned grains, and other diseases and damages are key factors in setting the price of grains or excluding dangerous grains from food production. Apart from challenges stemming from differences of the data from the standard toy datasets, we also present challenges that need to be overcome when explaining deep learning models. For example, explainability methods have many hyperparameters that can give different results, and the ones published in the papers do not work on dissimilar images. Other challenges are more general: problems with visualization of the explanations and their comparison since the magnitudes of their values differ from method to method. An open fundamental question also is: How to evaluate explanations? It is a non-trivial task because the "ground truth" is usually missing or ill-defined. Also, human annotators may create what they think is an explanation of the task at hand, yet the machine learning model might solve it in a different and perhaps counter-intuitive way. We discuss several of these challenges and evaluate various post-hoc explainability methods on grain data. We focus on robustness, quality of explanations, and similarity to particular "ground truth" annotations made by experts. The goal is to find the methods that overall perform well and could be used in this challenging task. We hope the proposed pipeline will be used as a framework for evaluating explainability methods in specific use cases.
Global Crop-Specific Fertilization Dataset from 1961-2019
Authors: Fernando Coello, Thomas Decorte, Iris Janssens, Steven Mortier, Jordi Sardans, Josep Peñuelas, Tim Verdonck
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2406.10001
Pdf link: https://arxiv.org/pdf/2406.10001
Abstract As global fertilizer application rates increase, high-quality datasets are paramount for comprehensive analyses to support informed decision-making and policy formulation in crucial areas such as food security or climate change. This study aims to fill existing data gaps by employing two machine learning models, eXtreme Gradient Boosting and HistGradientBoosting algorithms to produce precise country-level predictions of nitrogen ($N$), phosphorus pentoxide ($P_2O_5$), and potassium oxide ($K_2O$) application rates. Subsequently, we created a comprehensive dataset of 5-arcmin resolution maps depicting the application rates of each fertilizer for 13 major crop groups from 1961 to 2019. The predictions were validated by both comparing with existing databases and by assessing the drivers of fertilizer application rates using the model's SHapley Additive exPlanations. This extensive dataset is poised to be a valuable resource for assessing fertilization trends, identifying the socioeconomic, agricultural, and environmental drivers of fertilizer application rates, and serving as an input for various applications, including environmental modeling, causal analysis, fertilizer price predictions, and forecasting.
Off-Policy Evaluation from Logged Human Feedback
Authors: Aniruddha Bhargava, Lalit Jain, Branislav Kveton, Ge Liu, Subhojyoti Mukherjee
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.10030
Pdf link: https://arxiv.org/pdf/2406.10030
Abstract Learning from human feedback has been central to recent advances in artificial intelligence and machine learning. Since the collection of human feedback is costly, a natural question to ask is if the new feedback always needs to collected. Or could we evaluate a new model with the human feedback on responses of another model? This motivates us to study off-policy evaluation from logged human feedback. We formalize the problem, propose both model-based and model-free estimators for policy values, and show how to optimize them. We analyze unbiasedness of our estimators and evaluate them empirically. Our estimators can predict the absolute values of evaluated policies, rank them, and be optimized.
Information Compression in the AI Era: Recent Advances and Future Challenges
Authors: Jun Chen, Yong Fang, Ashish Khisti, Ayfer Ozgur, Nir Shlezinger, Chao Tian
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2406.10036
Pdf link: https://arxiv.org/pdf/2406.10036
Abstract This survey articles focuses on emerging connections between the fields of machine learning and data compression. While fundamental limits of classical (lossy) data compression are established using rate-distortion theory, the connections to machine learning have resulted in new theoretical analysis and application areas. We survey recent works on task-based and goal-oriented compression, the rate-distortion-perception theory and compression for estimation and inference. Deep learning based approaches also provide natural data-driven algorithmic approaches to compression. We survey recent works on applying deep learning techniques to task-based or goal-oriented compression, as well as image and video compression. We also discuss the potential use of large language models for text compression. We finally provide some directions for future research in this promising field.
Comparison of fine-tuning strategies for transfer learning in medical image classification
Authors: Ana Davila, Jacinto Colan, Yasuhisa Hasegawa
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.10050
Pdf link: https://arxiv.org/pdf/2406.10050
Abstract In the context of medical imaging and machine learning, one of the most pressing challenges is the effective adaptation of pre-trained models to specialized medical contexts. Despite the availability of advanced pre-trained models, their direct application to the highly specialized and diverse field of medical imaging often falls short due to the unique characteristics of medical data. This study provides a comprehensive analysis on the performance of various fine-tuning methods applied to pre-trained models across a spectrum of medical imaging domains, including X-ray, MRI, Histology, Dermoscopy, and Endoscopic surgery. We evaluated eight fine-tuning strategies, including standard techniques such as fine-tuning all layers or fine-tuning only the classifier layers, alongside methods such as gradually unfreezing layers, regularization based fine-tuning and adaptive learning rates. We selected three well-established CNN architectures (ResNet-50, DenseNet-121, and VGG-19) to cover a range of learning and feature extraction scenarios. Although our results indicate that the efficacy of these fine-tuning methods significantly varies depending on both the architecture and the medical imaging type, strategies such as combining Linear Probing with Full Fine-tuning resulted in notable improvements in over 50% of the evaluated cases, demonstrating general effectiveness across medical domains. Moreover, Auto-RGN, which dynamically adjusts learning rates, led to performance enhancements of up to 11% for specific modalities. Additionally, the DenseNet architecture showed more pronounced benefits from alternative fine-tuning approaches compared to traditional full fine-tuning. This work not only provides valuable insights for optimizing pre-trained models in medical image analysis but also suggests the potential for future research into more advanced architectures and fine-tuning methods.
TACCO: Task-guided Co-clustering of Clinical Concepts and Patient Visits for Disease Subtyping based on EHR Data
Authors: Ziyang Zhang, Hejie Cui, Ran Xu, Yuzhang Xie, Joyce C. Ho, Carl Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.10061
Pdf link: https://arxiv.org/pdf/2406.10061
Abstract The growing availability of well-organized Electronic Health Records (EHR) data has enabled the development of various machine learning models towards disease risk prediction. However, existing risk prediction methods overlook the heterogeneity of complex diseases, failing to model the potential disease subtypes regarding their corresponding patient visits and clinical concept subgroups. In this work, we introduce TACCO, a novel framework that jointly discovers clusters of clinical concepts and patient visits based on a hypergraph modeling of EHR data. Specifically, we develop a novel self-supervised co-clustering framework that can be guided by the risk prediction task of specific diseases. Furthermore, we enhance the hypergraph model of EHR data with textual embeddings and enforce the alignment between the clusters of clinical concepts and patient visits through a contrastive objective. Comprehensive experiments conducted on the public MIMIC-III dataset and Emory internal CRADLE dataset over the downstream clinical tasks of phenotype classification and cardiovascular risk prediction demonstrate an average 31.25% performance improvement compared to traditional ML baselines and a 5.26% improvement on top of the vanilla hypergraph model without our co-clustering mechanism. In-depth model analysis, clustering results analysis, and clinical case studies further validate the improved utilities and insightful interpretations delivered by TACCO. Code is available at this https URL.
Biomarker based Cancer Classification using an Ensemble with Pre-trained Models
Authors: Chongmin Lee, Jihie Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2406.10087
Pdf link: https://arxiv.org/pdf/2406.10087
Abstract Certain cancer types, namely pancreatic cancer is difficult to detect at an early stage; sparking the importance of discovering the causal relationship between biomarkers and cancer to identify cancer efficiently. By allowing for the detection and monitoring of specific biomarkers through a non-invasive method, liquid biopsies enhance the precision and efficacy of medical interventions, advocating the move towards personalized healthcare. Several machine learning algorithms such as Random Forest, SVM are utilized for classification, yet causing inefficiency due to the need for conducting hyperparameter tuning. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results.
Trustworthy Artificial Intelligence in the Context of Metrology
Authors: Tameem Adel, Sam Bilson, Mark Levene, Andrew Thompson
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.10117
Pdf link: https://arxiv.org/pdf/2406.10117
Abstract We review research at the National Physical Laboratory (NPL) in the area of trustworthy artificial intelligence (TAI), and more specifically trustworthy machine learning (TML), in the context of metrology, the science of measurement. We describe three broad themes of TAI: technical, socio-technical and social, which play key roles in ensuring that the developed models are trustworthy and can be relied upon to make responsible decisions. From a metrology perspective we emphasise uncertainty quantification (UQ), and its importance within the framework of TAI to enhance transparency and trust in the outputs of AI systems. We then discuss three research areas within TAI that we are working on at NPL, and examine the certification of AI systems in terms of adherence to the characteristics of TAI.
Misam: Using ML in Dataflow Selection of Sparse-Sparse Matrix Multiplication
Authors: Sanjali Yadav, Bahar Asgari
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.10166
Pdf link: https://arxiv.org/pdf/2406.10166
Abstract Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning. These applications exploit the sparsity of matrices to reduce storage and computational demands. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Traditional hardware accelerators are tailored for specific sparsity patterns with fixed dataflow schemes - inner, outer, and row-wise but often perform suboptimally when the actual sparsity deviates from these predetermined patterns. As the use of SpGEMM expands across various domains, each with distinct sparsity characteristics, the demand for hardware accelerators that can efficiently handle a range of sparsity patterns is increasing. This paper presents a machine learning based approach for adaptively selecting the most appropriate dataflow scheme for SpGEMM tasks with diverse sparsity patterns. By employing decision trees and deep reinforcement learning, we explore the potential of these techniques to surpass heuristic-based methods in identifying optimal dataflow schemes. We evaluate our models by comparing their performance with that of a heuristic, highlighting the strengths and weaknesses of each approach. Our findings suggest that using machine learning for dynamic dataflow selection in hardware accelerators can provide upto 28 times gains.
Selecting Interpretability Techniques for Healthcare Machine Learning models
Authors: Daniel Sierra-Botero, Ana Molina-Taborda, Mario S. Valdés-Tresanco, Alejandro Hernández-Arango, Leonardo Espinosa-Leal, Alexander Karpenko, Olga Lopez-Acevedo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2406.10213
Pdf link: https://arxiv.org/pdf/2406.10213
Abstract In healthcare there is a pursuit for employing interpretable algorithms to assist healthcare professionals in several decision scenarios. Following the Predictive, Descriptive and Relevant (PDR) framework, the definition of interpretable machine learning as a machine-learning model that explicitly and in a simple frame determines relationships either contained in data or learned by the model that are relevant for its functioning and the categorization of models by post-hoc, acquiring interpretability after training, or model-based, being intrinsically embedded in the algorithm design. We overview a selection of eight algorithms, both post-hoc and model-based, that can be used for such purposes.

qiaoyuet / arxiv_daily

New submissions for Monday, 17 June 2024 (showing 382 of 382 entries ) #122

Keyword: differential privacy

Is Diffusion Model Safe? Severe Data Leakage via Gradient-Guided Diffusion Model

Keyword: privacy

Is Diffusion Model Safe? Severe Data Leakage via Gradient-Guided Diffusion Model

FLea: Addressing Data Scarcity and Label Skew in Federated Learning via Privacy-preserving Feature Augmentation

My Body My Choice: Human-Centric Full-Body Anonymization

Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos

Heterogeneous Federated Learning with Convolutional and Spiking Neural Networks

Privacy-preserving Quantification of Non-IID Degree in Federated Learning

Speed-up of Data Analysis with Kernel Trick in Encrypted Domain

Faster Convergence on Heterogeneous Federated Edge Learning: An Adaptive Sidelink-Assisted Data Multicasting Approach

Same App, Different Behaviors: Uncovering Device-specific Behaviors in Android Apps

Federated Learning driven Large Language Models for Swarm Intelligence: A Survey

Dataset Condensation with Latent Quantile Matching

Consistent Update Synthesis via Privatized Beliefs

Unobtrusive Monitoring of Physical Weakness: A Simulated Approach

Data Ethics in the Era of Healthcare Artificial Intelligence in Africa: An Ubuntu Philosophy Perspective

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

Keyword: machine learning

Simulating Realistic Post-Stroke Reaching Kinematics with Generative Adversarial Networks

FeatNavigator: Automatic Feature Augmentation on Tabular Data

Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale

Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis

Multi-Modal Retrieval For Large Language Model Based Speech Recognition

DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks

Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition

Learning Language Structures through Grounding

Heterogeneous Federated Learning with Convolutional and Spiking Neural Networks

Privacy-preserving Quantification of Non-IID Degree in Federated Learning

Explainable AI for Comparative Analysis of Intrusion Detection Models

FreeCtrl: Constructing Control Centers with Feedforward Layers for Learning-Free Controllable Text Generation

Differentiable Programming for Differential Equations: A Review

Speed-up of Data Analysis with Kernel Trick in Encrypted Domain

Cross-view geo-localization: a survey

A Multivocal Review of MLOps Practices, Challenges and Open Issues

Faster Convergence on Heterogeneous Federated Edge Learning: An Adaptive Sidelink-Assisted Data Multicasting Approach

Implementing a Machine Learning Deformer for CG Crowds: Our Journey

Soil nitrogen forecasting from environmental variables provided by multisensor remote sensing images

Federated Learning driven Large Language Models for Swarm Intelligence: A Survey

Dataset Condensation with Latent Quantile Matching

Positive-Unlabelled Learning for Identifying New Candidate Dietary Restriction-related Genes among Ageing-related Genes

Learning Solution-Aware Transformers for Efficiently Solving Quadratic Assignment Problem

Impact of Speech Mode in Automatic Pathological Speech Detection

Challenges in explaining deep learning models for data with biological variation

Global Crop-Specific Fertilization Dataset from 1961-2019

Off-Policy Evaluation from Logged Human Feedback

Information Compression in the AI Era: Recent Advances and Future Challenges

Comparison of fine-tuning strategies for transfer learning in medical image classification

TACCO: Task-guided Co-clustering of Clinical Concepts and Patient Visits for Disease Subtyping based on EHR Data

Biomarker based Cancer Classification using an Ensemble with Pre-trained Models

Trustworthy Artificial Intelligence in the Context of Metrology

Misam: Using ML in Dataflow Selection of Sparse-Sparse Matrix Multiplication

Selecting Interpretability Techniques for Healthcare Machine Learning models