lanwuwei / SPM_toolkit

Neural network toolkit for sentence pair modeling.
302 stars 70 forks source link

Tokenize Method for Quora dataset #13

Closed mapingshuo closed 5 years ago

mapingshuo commented 5 years ago

Hi, I am recently reimplementing the SSE model and I am confused how you pre-process the quora_duplicate_questions.tsv:

  1. I wonder how you generate the /pytorch/DeepPairWiseWord/data/quora/a.tok and b.tok? What tokenized method do you use?
  2. How do you split train/test/dev dataset from the quora_duplicate_questions.tsv? do you use same split as "Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences. In Proceedings of IJCAI." and "Neural Paraphrase Identification of Questions with Noisy Pretraining"? I would appreciate it if you could answer my questions, Thank you.
lanwuwei commented 5 years ago
  1. I followed this code to generate a.tok and b.tok;
  2. Yes, I used the same split.
mapingshuo commented 5 years ago

Thanks ~