Tokenize Method for Quora dataset

mapingshuo commented 5 years ago

Hi, I am recently reimplementing the SSE model and I am confused how you pre-process the quora_duplicate_questions.tsv:

I wonder how you generate the /pytorch/DeepPairWiseWord/data/quora/a.tok and b.tok? What tokenized method do you use?
How do you split train/test/dev dataset from the quora_duplicate_questions.tsv? do you use same split as "Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences. In Proceedings of IJCAI." and "Neural Paraphrase Identification of Questions with Noisy Pretraining"? I would appreciate it if you could answer my questions, Thank you.

lanwuwei commented 5 years ago

mapingshuo commented 5 years ago

Thanks ~

lanwuwei / SPM_toolkit