howardyclo / Kaggle-Quora-Question-Pairs

This is our team's solution report, which achieves top 10% (305/3307) in this competition.
62 stars 14 forks source link
deep-learning kaggle natural-language-processing paraphrase-identification

Kaggle Competition: Quora Question Pairs

Video Version

Abstract

In the Quora Question Pairs competition, we were challenged to tackle the natural language processing (NLP) problem, given the question pairs, classify whether question pairs are duplicates or not. This report describes our team's solution, which acheives top 10% (305/3307) in this competition.

Disclaimer

Some parts of our solution are referenced from the kernel and discussion. We are really appreciated to those amazing people in the Kaggle community who shared their knowledge selflessly.

Team Membors

Dataset and Evaluation Metric

The dataset is released by Quora, which is a well-known platform to gain and share knowledge about anything. Let's have a glance on the dataset:

The evaluation metric in this competition is log loss (or cross-entropy loss in binary classfication):

Main Tools

We use several popular NLP tools to help us acheive several necessary NLP tasks:

Method

Text Preprocessing

Here are some NLP tasks we've done in detail, we thought that some appropriate processings might help getting better performance, but we also found that "over-cleaning" data might lead to information loss:

Feature Engineering

Modeling

We basically tried dozens of model architectures and parameter tunings. Here we only report the models that works for us.

Data Augmentation

A simple and helpful approach but risky in some cases, for example, in this competition, the porpotion of the positive example is 37% in our training data. But if we upsample our positive examples to 50%, it might be problematic in terms of mismatching the actual class label distribution in testing data (or similarly, in the real world), which may introduce a positive bias to our model, especially the tree-based model like XGBoost we used in this competition. So finally, we mainly use class label reweighting technique (describled below) to address imbalanced data problem.

Class Label Reweighting

The participants of this competition had noticed that there is a different porpotion of positive examples between the training data (~37%) and the testing data (~17%). That is, we can probably get different evaluation score on our validation set and testing data. In order to address this problem, we assigned different class weights when training our models.

Note that we could not know the porpotion of the positive examples in the testing data in advance, but we can simply estimate it by making a constant prediction at the training data with the mean of 0.369. After submitting to Kaggle, we got a score of 0.554. Using the log-loss formula, we can derive the porpotion of the positive example in the testing data is 0.174. Finally, we can then reweight positive class label to 0.174/0.369 and reweight negative class label to (1-0.174/(1-0.369).

Training and Ensembling

We split ~400,000 training data into ~360,000 examples as training set and ~40,000 examples as validation set.

We further split the original ~40,000 validation set into ~28,000 training set and ~12,000 validation set for training our ensembling models.

But if we want to do ensembling, we shouldn't train each ensemble model on the same 360,000 training set, which makes the ensembling models overfit to the 360,000 training set very rapidly and losses generalizability.

Pros:

Cons:

We also did traditional stacking with LSTM directly, but we found that this overfits extremely and has no generalizability, since the validation loss was about only ~0.11, but ~0.16 on the Kaggle private log loss.

Submission Score

We finally achieves top 10% (319/3394) in this competition.

(Note: The public log loss is calculated with approximately 35% of the testing data. The private log loss is based on the other 65%. The Kaggle competition's ranking score is based on the private log loss.)