Regarding the BiLSTM baseline model stated in the PAWS paper

google-research-datasets / paws

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.

Other

553 stars 52 forks source link

Regarding the BiLSTM baseline model stated in the PAWS paper #3

Open AladarMiao opened 5 years ago

AladarMiao commented 5 years ago

If I read the PAWS paper correctly, it stated that BiLSTM+cosine similarity is one of the baseline models that was used to evaluate the PAWS dataset. I tried to reenact the experiment with a BiLSTM+cosine similarity model I designed, but the accuracy is still quite far from the accuracy stated in the paper. Is there somewhere to see how you guys defined the BiLSTM+cosine similarity model? It would be really helpful on my current study regarding paraphrase identification. Thanks in advance!

yuanzh commented 5 years ago

Hi, sorry for the delay. Could you please specify which number in the paper you would like to compare to, and whether you got a lower or a higher accuracy number?

Regarding our model architecture, it's a standard BiLSTM with dropout=0.2, hidden size = 256, activation = relu, using the first/last state vec of the forward/backward LSTM, and Glove embedding. What's your model configuration?

AladarMiao commented 5 years ago

I am currently using a self trained embedding, BiLSTM, last state vec, concatenate, and dense as the last layer. If what you stated is the case, where does cosine similarity come in? I am comparing my model with what's stated on page 8 of the paper, where BiLSTM achieved a 86.3 acc and 91.6 auc.

yuanzh commented 5 years ago

Each input is first mapped to a vector by the BiLSTM. Let v_l, and v_r be the vectors of the left/right inputs.
The final score is sigmoid(a(cosine_similarity(v_l, v_r) + b)) where a and b are learned variables. I'm not sure if the affine transformation makes a big difference.

Just to be more precise, we take the state at the last token for the forward LSTM, and the state at the first token for the backward LSTM. Concatenate the two states and add a dense layer to project them to the required dimension (256).

AladarMiao commented 5 years ago

Thanks!