jaderabbit commented 4 years ago

In this issue you can either:

Add papers that you think are interesting to read and discuss (please stick to the format).
vote: should be done using :+1: on comments

Example: https://github.com/hadyelsahar/awesome-reading-group/issues/1

elyesmanai commented 4 years ago

Generalization Through Memorization: Nearest Neighbor Language Models

https://openreview.net/pdf?id=HklBjCEKvH

Short Description:

The authors introduced kNN-LMs, which can significantly outperform standard language models by directly querying training examples at test time. The approach can be applied to any neural language model. The success of this method suggests that learning similarity functions between contexts may be an easier problem than predicting the next word from some given context

dadelani commented 4 years ago

Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi

https://arxiv.org/abs/1912.02481

Short description: In this paper, we focus on two African languages, Yorùbá and Twi, and compare the word embeddings obtained from crawled data on the web, with word embeddings obtained from curated corpora and language-dependent processing. We analyse the noise in the publicly available corpora, collect high quality and noisy data for the two languages and quantify the improvements that depend not only on the amount of data but on the quality too. We also use different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages. For the evaluation, we manually translate the wordsim-353 word pairs dataset from English into Yorùbá and Twi. We extend the analysis to contextual word embeddings and evaluate multilingual BERT on a named entity recognition task. For this, we annotate with named entities the Global Voices corpus for Yorùbá.

dadelani commented 4 years ago

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

https://arxiv.org/pdf/2003.11080.pdf

Short description Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages (including Swahili & Yoruba) and 9 tasks.

dadelani commented 4 years ago

On the Cross-lingual Transferability of Monolingual Representations

https://arxiv.org/abs/1910.11856

Short description State-of-the-art unsupervised multilingual models (e.g., multilingual BERT) have been shown to generalize in a zero-shot cross-lingual setting. This generalization ability has been attributed to the use of a shared subword vocabulary and joint training across multiple languages giving rise to deep multilingual abstractions. We evaluate this hypothesis by designing an alternative approach that transfers a monolingual model to new languages at the lexical level. More concretely, we first train a transformer-based masked language model on one language, and transfer it to a new language by learning a new embedding matrix with the same masked language modeling objective, freezing parameters of all other layers. This approach does not rely on a shared vocabulary or joint training. However, we show that it is competitive with multilingual BERT on standard cross-lingual classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD). Our results contradict common beliefs of the basis of the generalization ability of multilingual models and suggest that deep monolingual models learn some abstractions that generalize across languages. We also release XQuAD as a more comprehensive cross-lingual benchmark, which comprises 240 paragraphs and 1190 question-answer pairs from SQuAD v1.1 translated into ten languages by professional translators.

hadyelsahar commented 4 years ago

A Controllable Model of Grounded Response Generation

https://arxiv.org/pdf/2005.00613.pdf

Summary Attempts to boost informativeness alone come at the expense of factual accuracy, as attested by GPT-2’s propensity to “hallucinate” facts. While this may be mitigated by access to background knowledge, there is scant guarantee of relevance and informativeness in generated responses. We propose a framework that we call controllable grounded response generation (CGRG), in which lexical control phrases are either provided by an user or automatically extracted by a content planner from dialogue context and grounding knowledge.

jaderabbit commented 4 years ago

MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

https://arxiv.org/abs/2005.00052

Abstract The main goal behind state-of-the-art pretrained multilingual models such as multilingual BERT and XLM-R is enabling and bootstrapping NLP applications in low-resource languages through zero-shot or few-shot cross-lingual transfer. However, due to limited model capacity, their transfer performance is the weakest exactly on such low-resource languages and languages unseen during pretraining. We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations. In addition, we introduce a novel invertible adapter architecture and a strong baseline method for adapting a pretrained multilingual model to a new language. MAD-X outperforms the state of the art in cross-lingual transfer across a representative set of typologically diverse languages on named entity recognition and achieves competitive results on question answering.

keleog commented 4 years ago

mBART - Multilingual Denoising Pre-training for Neural Machine Translation https://arxiv.org/abs/2001.08210

Abstract: This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -- a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.

keleog commented 4 years ago

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework

https://openreview.net/pdf?id=S1l-C0NtwS

Abstract Abstract: Learning multilingual representations of text has proven a successful method for many cross-lingual transfer learning tasks. There are two main paradigms for learning such representations: (1) alignment, which maps different independently trained monolingual representations into a shared space, and (2) joint training, which directly learns unified multilingual representations using monolingual and cross-lingual objectives jointly. In this paper, we first conduct direct comparisons of representations learned using both of these methods across diverse cross-lingual tasks. Our empirical results reveal a set of pros and cons for both methods, and show that the relative performance of alignment versus joint training is task-dependent. Stemming from this analysis, we propose a simple and novel framework that combines these two previously mutually-exclusive approaches. Extensive experiments demonstrate that our proposed framework alleviates limitations of both approaches, and outperforms existing methods on the MUSE bilingual lexicon induction (BLI) benchmark. We further show that this framework can generalize to contextualized representations such as Multilingual BERT, and produces state-of-the-art results on the CoNLL cross-lingual NER benchmark.

Jamiil92 commented 4 years ago

Word Translation Without Parallel Data

https://arxiv.org/pdf/1710.04087.pdf

Abstract: State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent studies showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation. Our code, embeddings and dictionaries are publicly available.

hadyelsahar commented 4 years ago

GPT-3 Language Models are Few-Shot Learners https://arxiv.org/pdf/2005.14165.pdf

Are the computation costs worth it? I think this paper can raise interesting discussions beyond the hype.

Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation

keleog commented 4 years ago

Unsupervised Domain Adaptation for Neural Machine Translation with Iterative Back Translation Link - https://arxiv.org/abs/2001.08140

Why? - I feel like this represents an easy way to possibly generalize our niche religious MT models.

Abstract: State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on do- mains with little supervised data. As data collection is expensive and infeasible in many cases, unsuper- vised domain adaptation methods are needed. We apply an Iterative Back Translation (IBT) train- ing scheme on in-domain monolingual data, which repeatedly uses a Transformer-based NMT model to create in-domain pseudo-parallel sentence pairs in one translation direction on the fly and then use them to train the model in the other direction. Evaluated on three domains of German-to-English translation task with no supervised data, this simple technique alone (without any out-of-domain parallel data) can already surpass all previous do- main adaptation methods—up to +9.48 BLEU over the strongest previous method, and up to +27.77 BLEU over the unadapted baseline. Moreover, given available supervised out-of-domain data on German-to-English and Romanian-to-English language pairs, we can further enhance the performance and obtain up to +19.31 BLEU improvement over the strongest baseline, and +47.69 BLEU increment against the unadapted model.

poppingtonic commented 4 years ago

Understanding Cross-Lingual Syntactic Transfer in Multilingual Recurrent Neural Networks

Link: https://arxiv.org/abs/2003.14056 Abstract: It is now established that modern neural language models can be successfully trained on multiple languages simultaneously without changes to the underlying architecture, providing an easy way to adapt a variety of NLP models to low-resource languages. But what kind of knowledge is really shared among languages within these models? Does multilingual training mostly lead to an alignment of the lexical representation spaces or does it also enable the sharing of purely grammatical knowledge? In this paper we dissect different forms of cross-lingual transfer and look for its most determining factors, using a variety of models and probing tasks. We find that exposing our language models to a related language does not always increase grammatical knowledge in the target language, and that optimal conditions for lexical-semantic transfer may not be optimal for syntactic transfer.

keleog commented 4 years ago

Enhancing Machine Translation with Dependency-Aware Self-Attention

Link - https://arxiv.org/abs/1909.03149

Abstract: Most neural machine translation models only rely on pairs of parallel sentences, assuming syntactic information is automatically learned by an attention mechanism. In this work, we investigate different approaches to incorporate syntactic knowledge in the Transformer model and also propose a novel, parameter-free, dependency-aware self-attention mechanism that improves its translation quality, especially for long sentences and in low-resource scenarios. We show the efficacy of each approach on WMT English↔German and English→Turkish, and WAT English→Japanese translation tasks.

bduvenhage commented 4 years ago

Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext

Link - https://www.aclweb.org/anthology/D17-1026.pdf github - https://github.com/jwieting/emnlp2017

Abstract: We consider the problem of learning general-purpose, paraphrastic sentence embeddings in the setting of Wieting et al. (2016b). We use neural machine translation to generate sentential paraphrases via back-translation of bilingual sentence pairs. We evaluate the paraphrase pairs by their ability to serve as training data for learning paraphrastic sentence embeddings. We find that the data quality is stronger than prior work based on bitext and on par with manually-written English paraphrase pairs, with the advantage that our approach can scale up to generate large training sets for many languages and domains. We experiment with several language pairs and data sources, and develop a variety of data filtering techniques. In the process, we explore how neural machine translation output differs from human-written sentences, finding clear differences in length, the amount of repetition, and the use of rare words.

jaderabbit commented 4 years ago

Balancing Training for Multilingual Neural Machine Translation

Abstract When training multilingual machine translation (MT) models that can translate to/from multiple languages, we are faced with imbalanced training sets: some languages have much more training data than others. Standard practice is to up-sample less resourced languages to increase representation, and the degree of up-sampling has a large effect on the overall performance. In this paper, we propose a method that instead automatically learns how to weight training data through a data scorer that is optimized to maximize performance on all test languages. Experiments on two sets of languages under both one-to-many and many-to-one MT settings show our method not only consistently outperforms heuristic baselines in terms of average performance, but also offers flexible control over the performance of which languages are optimized.

Value Because multilingual methods are so sensitive to sampling, I think an approach like this would be amazing

https://arxiv.org/abs/2004.06748

hadyelsahar commented 4 years ago

What Kind of Language Is Hard to Language-Model? ACL19 https://arxiv.org/pdf/1906.04726.pdf

Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the highresource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus.

keleog commented 4 years ago

Transferring Inductive Biases through Knowledge Distillation

Having the right inductive biases can be crucial in many tasks or scenarios where data or computing resources are a limiting factor, or where training data is not perfectly representative of the conditions at test time. However, defining, designing and efficiently adapting inductive biases is not necessarily straightforward. In this paper, we explore the power of knowledge distillation for transferring the effect of inductive biases from one model to another. We consider families of models with different inductive biases, LSTMs vs. Transformers and CNNs vs. MLPs, in the context of tasks and scenarios where having the right inductive biases is critical. We study how the effect of inductive biases is transferred through knowledge distillation, in terms of not only performance but also different aspects of converged solutions

hadyelsahar commented 4 years ago

Biases in of Pretrained Language models

The Woman Worked as a Babysitter: On Biases in Language Generation EMNLP2019 https://www.aclweb.org/anthology/D19-1339.pdf

StereoSet: Measuring stereotypical bias in pretrained language models https://arxiv.org/pdf/2004.09456.pdf and a recent competition: https://stereoset.mit.edu/

dnzengou commented 4 years ago

On bias in ML models (sparked by a conversation between Timnit Gebru and Yann LeCun)

Disclosure: It's my first time sharing reading materials here. Apologies if this is not relevant to this thread!

Intially shared by @rajiinio twitter.com/rajiinio on representation in datasets (and its limits) 👇

"No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World" https://research.google/pubs/pub46553/

"ConvNets and ImageNet Beyond Accuracy: Understanding Mistakes and Uncovering Biases" https://arxiv.org/abs/1711.11443

Newer papers:

"Does Object Recognition Work for Everyone?" https://arxiv.org/abs/1906.02659

"Predictive Inequity in Object Detection" https://arxiv.org/abs/1902.11097

"Gender Shades" http://gendershades.org (+ "Actionable Auditing" https://dl.acm.org/doi/10.1145/3306618.3314244, "Saving Face" https://arxiv.org/abs/2001.00964)

"Garbage In, Garbage Out: Face Recognition on Flawed Data" https://law.georgetown.edu/privacy-technology-center/publications/garbage-in-garbage-out-face-recognition-on-flawed-data/

"Excavating AI: The Politics of Images in Machine Learning Training Sets"

+ sources about the limitations of fairness as well. "Where fairness fails: data, algorithms, and the limits of anti-discrimination discourse" https://tandfonline.com/doi/abs/10.1080/1369118X.2019.1573912

+ @timnitGebru & @cephaloponderer tutorial @CVPR On Fairness Accountability Transparency and Ethics in Computer Vision Presented via tutorial at CVPR2020 By Dr Timnit Gebru (AI researcher at Goog' DeepAI & Emily Danton, DeepAI

"Part 1 Computer vision in practice: who is benefiting and who is being harmed?" https://youtu.be/0sBE5OyD7fk Slides: https://doc-0k-b0-docs.googleusercontent.com/docs/securesc/6jfpn6rcfuvivo8iprn1bbqevoft12pi/9u1b6t998gecojq84ga33m2tl8t8mqna/1593070500000/09064895144371962914/03942165428508745360/1rcG8KVmjRUWWNSg-R6cTBlAScP9UkCJp

+ "Part 2 Data ethics" Slides: https://doc-0k-b0-docs.googleusercontent.com/docs/securesc/6jfpn6rcfuvivo8iprn1bbqevoft12pi/c0urbvmn857v1ite54ltv639obbkv93f/1593070950000/09064895144371962914/03942165428508745360/1IvUgCTUciIJQ-dIqQAYNO11X3guzqnYN

+ "Part 3 Towards more socially responsible and ethics-informed research practices" Slides: https://doc-0o-b0-docs.googleusercontent.com/docs/securesc/6jfpn6rcfuvivo8iprn1bbqevoft12pi/1dbch65q7ao55sndaesem0u13gsoj9pp/1593070875000/09064895144371962914/03942165428508745360/1vyXysJVGmn72AxOuEKPAa8moi1lBmzGc?e=download&authuser=0&nonce=a0tf868mmvnii&user=03942165428508745360&hash=psp4b8k1fafmoaf029grqmafg7i8mc4d

Den lör 27 juni 2020 17:08hady elsahar notifications@github.com skrev:

Biases in of Pretrained Language models

The Woman Worked as a Babysitter: On Biases in Language Generation EMNLP2019 https://www.aclweb.org/anthology/D19-1339.pdf

https://user-images.githubusercontent.com/1453243/85925430-b89c1d00-b898-11ea-855e-b5719b73869a.png

StereoSet: Measuring stereotypical bias in pretrained language models https://arxiv.org/pdf/2004.09456.pdf and a recent competition: https://stereoset.mit.edu/

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/masakhane-io/masakhane-reading-group/issues/1#issuecomment-650572781, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3SEXJF4QQ347V4TFZTW4LRYYDN5ANCNFSM4MXVZD2A .

chrisemezue commented 4 years ago

Predicting Performance for Natural Language Processing Tasks

Link: https://www.aclweb.org/anthology/2020.acl-main.764.pdf

Abstract: Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, without actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures, outperforming reasonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of representative experiments that should be run in order to obtain plausible predictions for all other experimental settings.

hadyelsahar commented 4 years ago

Decolonial AI: Decolonial Theory as Sociotechnical Foresight in Artificial Intelligence #11 https://arxiv.org/pdf/2007.04068.pdf proposed by @elevsev

Summary: This paper explores the important role of critical science, and in particular of post-colonial and decolonial theories, in understanding and shaping the ongoing advances in artificial intelligence. Artificial Intelligence (AI) is viewed as amongst the technological advances that will reshape modern societies and their relations. Whilst the design and deployment of systems that continually adapt holds the promise of far-reaching positive change, they simultaneously pose significant risks, especially to already vulnerable peoples. Values and power are central to this discussion. Decolonial theories use historical hindsight to explain patterns of power that shape our intellectual, political, economic, and social world. By embedding a decolonial critical approach within its technical practice, AI communities can develop foresight and tactics that can better align research and technology development with established ethical principles, centring vulnerable peoples who continue to bear the brunt of negative impacts of innovation and scientific progress. We highlight problematic applications that are instances of coloniality, and using a decolonial lens, submit three tactics that can form a decolonial field of artificial intelligence: creating a critical technical practice of AI, seeking reverse tutelage and reverse pedagogies, and the renewal of affective and political communities. The years ahead will usher in a wave of new scientific breakthroughs and technologies driven by AI research, making it incumbent upon AI communities to strengthen the social contract through ethical foresight and the multiplicity of intellectual perspectives available to us; ultimately supporting future technologies that enable greater well-being, with the goal of beneficence and justice for all.

orevaahia commented 4 years ago

Towards Ecologically Valid Research on Language User Interfaces

Link: https://arxiv.org/pdf/2007.14435.pdf

Abstract: Language User Interfaces (LUIs) could improve human-machine interaction for a wide variety of tasks, such as playing music, getting insights from databases, or instructing domestic robots. In contrast to traditional hand-crafted approaches, recent work attempts to build LUIs in a data-driven way using modern deep learning methods. To satisfy the data needs of such learning algorithms, researchers have constructed benchmarks that emphasize the quantity of collected data at the cost of its naturalness and relevance to real-world LUI use cases. As a consequence, research findings on such benchmarks might not be relevant for developing practical LUIs. The goal of this paper is to bootstrap the discussion around this issue, which we refer to as the benchmarks’ low ecological validity. To this end, we describe what we deem an ideal methodology for machine learning research on LUIs and categorize five common ways in which recent benchmarks deviate from it. We give concrete examples of the five kinds of deviations and their consequences. Lastly, we offer a number of recommendations as to how to increase the ecological validity of machine learning research on LUIs.

masakhane-io / masakhane-reading-group

Papers Voting #1

Generalization Through Memorization: Nearest Neighbor Language Models

Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

On the Cross-lingual Transferability of Monolingual Representations

A Controllable Model of Grounded Response Generation

MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework

Word Translation Without Parallel Data

Understanding Cross-Lingual Syntactic Transfer in Multilingual Recurrent Neural Networks

Balancing Training for Multilingual Neural Machine Translation

Transferring Inductive Biases through Knowledge Distillation