In the Quora Question Pairs competition, we were challenged to tackle the natural language processing (NLP) problem, given the question pairs, classify whether question pairs are duplicates or not. This report describes our team's solution, which acheives top 10% (305/3307) in this competition.
Some parts of our solution are referenced from the kernel and discussion. We are really appreciated to those amazing people in the Kaggle community who shared their knowledge selflessly.
The dataset is released by Quora, which is a well-known platform to gain and share knowledge about anything. Let's have a glance on the dataset:
The evaluation metric in this competition is log loss (or cross-entropy loss in binary classfication):
We use several popular NLP tools to help us acheive several necessary NLP tasks:
Here are some NLP tasks we've done in detail, we thought that some appropriate processings might help getting better performance, but we also found that "over-cleaning" data might lead to information loss:
Text cleaning: Tokenizatiom, convert all tokens to lower case, remove punctuations, special tokens, etc.
Text normalization: Restore abbreviations (e.g. "What's" to "What is", "We're" to "We are", etc.) using regular expression, replace number-like string with "number" and replace currency symbols with "USD" abbreviation.
Name entity recognition (NER):
Automatic spelling error correction:
Basic text features:
Syntatic features: We convert a sentence to POS tag sequence and dependency tag sequence. For example, "Can I recover my email if i forgot the password" to POS tag sequence "VERB PRON VERB ADJ NOUN ADP PRON VERB DET NOUN", and to dependency tag sequence "aux nsubj ROOT poss dobj mark nsubj advcl det dobj". We then compute basic features based on those sequences to generate a set of new features. Also, "ROOT" postion is also a new feature.
Latent semantic embedding features: Character and word N-gram (N=1~3) count matrices, TF-IDF weight matrices with L1, L2-norm. We also applied SVD on TF-IDF weight matrices to perform dimension reduction.
Neural embedding features: Word2Vec pretrained on Google News, GloVe and FastText pretrained on Wiki. We basically sum all of word embeddings and normalize it to unit vector to represent a sentence (so-called "sentence embedding"). In our case, DL models perform better with embeddings pretrained on Wiki than Google News. We finally choose FastText embeddings as input features for our DL models, since FastText trains on larger Wiki corpus than GloVe.
Embedding distance features: We also compute various distance metrics, such as cosine, cityblock, canberra, euclidean, minkowski, braycurtis distance for both Word2Vec and FastText sentence embeddings, and also word mover distance only for Word2Vec.
Embedding skewness and kurtosis: Skewness is a measure of lacking of symmetry. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
Other "magic" features (or "competition-oriented hacking features"):
Word correction count: Should be a strong feature, since some of typos of data are machine-generated and have some pattern amoung them. But it took way too much time to compute, so we only tried on training data but end up giving up due to the requirement of expensive computation on testing data.
Question ID (qid): Originally, qid is only an identifier for specific question. But one of the participants found that the qid is actually a time dependent serial number (by searching on Quora directly). Furthermore, the duplicate label ratio becomes lower and lower as the time pass. The hypothesis is that Quora itself also tried to reduce duplicated question in this period of time. So, the qid becomes a strong feature.
Numbers of question occurrence: One suggested that the duplicated questions tend to be the same group of questions. In other words, the more time a question appears in the training set, the more probable such question pair is duplicated regardless of what question is paired with it.
And other features shared from the Kaggle community.
It is worth mention that those magic features end up helping a lot. Though we call those features "magic", in fact, we think those magic features basically capture some hidden phenomenon behind the data, which make them become strong features in this task.
We basically tried dozens of model architectures and parameter tunings. Here we only report the models that works for us.
XGBoost with all features described in Feature Engineering. We exclude sentence embeddings for training simply due to the expensive computation caused by the large dimensionality. Moreover, XGBoost performs well enough only with those handcrafted features.
Other models we've tried:
All of the models mentioned above resulting with log-loss about 0.30~0.32 (without handcrafed features but only neural word embeddings) and accuracy about 85%, which makes us think again if the model has already sufficiently used all the information we provided and starts to become overfitting. So, we starts to focus on feature engineering instead of trying more fancy deep learning models.
A simple and helpful approach but risky in some cases, for example, in this competition, the porpotion of the positive example is 37% in our training data. But if we upsample our positive examples to 50%, it might be problematic in terms of mismatching the actual class label distribution in testing data (or similarly, in the real world), which may introduce a positive bias to our model, especially the tree-based model like XGBoost we used in this competition. So finally, we mainly use class label reweighting technique (describled below) to address imbalanced data problem.
Can we simply upsample by forming two identical questions together as a new training example?
NOT EXACTLY. There doesn't exist any of such kind of examples in the dataset. It may improve generalizability for "human", but not for this dataset, hence for this competiton.
Does two duplicate pairs (A,B) and (B,C) imply that (A,C) pair is also duplicate?
NO. There exists 1% mislabeled examples in the dataset. Both methods destruct the original data distribution. There exists some (A,C) pairs that are labeled as non-duplicate pair.
Can we change the order when concatenating question pair embeddings horizontally as a data augmentation technique (e.g. x1 = concate(q1, q2, axis=1)
to x2 = concate(q2, q1, axis=1)
) when using neural networks?
YES. This works. Since x1
and x2
might yeild different results when feeding into the FC layer, we used such technique to balance our network training. Unfortunately, during the competition, we forgot to switch the externel handcrafted features' order of the question pair (described in Feature Engineering) when feeding into the FC layer along with question pair embeddings, which might effect badly to our model's performance.
The participants of this competition had noticed that there is a different porpotion of positive examples between the training data (~37%) and the testing data (~17%). That is, we can probably get different evaluation score on our validation set and testing data. In order to address this problem, we assigned different class weights when training our models.
Note that we could not know the porpotion of the positive examples in the testing data in advance, but we can simply estimate it by making a constant prediction at the training data with the mean of 0.369
. After submitting to Kaggle, we got a score of 0.554
. Using the log-loss formula, we can derive the porpotion of the positive example in the testing data is 0.174
. Finally, we can then reweight positive class label to 0.174/0.369
and reweight negative class label to (1-0.174/(1-0.369)
.
We split ~400,000 training data into ~360,000 examples as training set and ~40,000 examples as validation set.
We further split the original ~40,000 validation set into ~28,000 training set and ~12,000 validation set for training our ensembling models.
But if we want to do ensembling, we shouldn't train each ensemble model on the same 360,000 training set, which makes the ensembling models overfit to the 360,000 training set very rapidly and losses generalizability.
Pros:
Cons:
We also did traditional stacking with LSTM directly, but we found that this overfits extremely and has no generalizability, since the validation loss was about only ~0.11, but ~0.16 on the Kaggle private log loss.
We finally achieves top 10% (319/3394) in this competition.
(Note: The public log loss is calculated with approximately 35% of the testing data. The private log loss is based on the other 65%. The Kaggle competition's ranking score is based on the private log loss.)