nasa-petal / data-collection-and-prep

Starting with a list of URLs of papers that can be used for crowdsourcing, create a CSV file with the URL, DOI of the paper, Title, Abstract, and if the paper is open access
The Unlicense
1 stars 5 forks source link

Explore data augmentation techniques to expand training dataset. #98

Open bruffridge opened 3 years ago

bruffridge commented 3 years ago

https://neptune.ai/blog/data-augmentation-nlp

bruffridge commented 3 years ago

Additionally, we introduce the technique of data augmentation. This technique is used to artificially grow the dataset by adding slightly perturbed copies of existing documents. We use an existing library, nlpaug, to perform data augmentation on our dataset, by replacing random words in the titles and abstracts with their own synonyms according to WordNet graph distance. In Figure 10, we examine the relationship between the augmentation factor (i.e., the number of perturbed copies of each paper existing within the enlarged dataset, including the original paper) and the performance of MATCH, but we find that in the limited data augmentation testing that we performed, the performance of MATCH did not increase significantly on any metric. We also consider balance-aware data augmentation, which aims to address label imbalance in a hierarchical multilabel classification problem, further in the Future Directions section of this report.

image

Balance-Aware Data Augmentation One problem which plagues many classification systems in general is that of label imbalance. Labels occur at different base rates in nature. For example, there may be more papers about the biomimicry function “protect from temperature” than about the biomimicry function “chemically break down inorganic compounds.” If the ratio between labels’ frequencies is too extreme, and the loss function is a binary cross-entropy loss, the classifier may be discouraged from ever predicting the rarer labels because it can safely achieve a high accuracy only predicting the most common labels. One solution to this is to oversample the less common labels. Our data augmentation scheme merely perturbed copies of the entire dataset, preserving the distribution of labels. In contrast, we also tested out a balance-aware data augmentation scheme to oversample the less common labels. However, because each paper has multiple labels, each in different levels of the PeTaL taxonomy, this problem is less trivial. Our tentative solution is to compute a rareness score r(p) for each paper. After counting the occurrences of each label, we compute the rareness score as follows. image

In this equation, where Lp is the set of labels in the paper p, count(l) is the count of occurrences of the label l within the entire PeTaL dataset, and α and β are tunable, free-standing parameters. We then augment each paper by a factor r, or the closest integer thereto. That is, we introduce that number of perturbed copies of the paper into the dataset. Intuitively, although some papers bear both low-frequency and high-frequency labels, we attempt to boost the prominence of low-frequency labels while refraining from augmenting the higher-frequency labels too much. Although our initial experiments with balance-aware data augmentation did not improve the precision of MATCH over conventional data augmentation, we have yet to explore all parameter settings and consider this to be another future direction of research.