How should I preprocess the data?

miyyer / scpn

syntactically controlled paraphrase networks

166 stars 35 forks source link

How should I preprocess the data? #2

Open shuangqinbuaa opened 6 years ago

shuangqinbuaa commented 6 years ago

If I just want to train the SCPN model, I just need to preprocess the para-nmt dataset. But what if I want to use SCPN to generate syntactically adversarial examples for downstream task? Should I preprocess (for example, tokenizing and BPE) the para-nmt dataset with the downstream task's dataset together? How did you preprocess SST and SICK data ? @miyyer @jwieting Thank you very much!

Henry-E commented 6 years ago

Did you ever figure this out? It looks like they use a regular parse tree. But obviously it would be best to parse using the same process they did.

I'm talking about what's the expected method for parsing the input sentences for paraphrasing. To get the output

a person in a black jacket is doing tricks on a motorbike
(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN in) (NP (DT a) (JJ black) (NN jacket)))) (VP (VBZ is) (VP (VBG doing) (NP (NNS tricks)) (PP (IN on) (NP (DT a) (NN motorbike))))) (. .)))

Henry-E commented 6 years ago

Also I'm curious how to create templates for the generation aspect. They have 10 default templates in the demo script but it would be useful to understand how they created these in order to create new ones.

Henry-E commented 6 years ago

The Stanford NLP constituency parser seems to work well. Though I am still curious about how to use different templates

miyyer commented 6 years ago

sorry for the enormously delayed response! we have added some functions to run on top of the corenlp output to make it easier to get your data into the right format (see extract_parses in read_paranmt_parses.py). @jwieting will soon add a file containing all of the templates in ParaNMT sorted by frequency so you can play around with more of them (in our paper, we use the top 20 most frequently-occurring templates).

kj-lai commented 5 years ago

Hi, just a friendly reminder, any update on the templates?

zhengliz commented 5 years ago

Hi @miyyer @jwieting, just a friendly reminder, could you kindly share how the paranmt dataset is preprocessed (tokenizing, BPE, etc.)? Thanks

LeeShiyang commented 4 years ago

Hi @miyyer @jwieting, just a friendly reminder, could you kindly share how the paranmt dataset is preprocessed (tokenizing, BPE, etc.)? Thanks

I also want to know the BPE and tokenizing part.

santimarro commented 4 years ago

I also want to know about the templates!