tagoyal / sow-reap-paraphrasing

Contains data/code for the paper "Neural Syntactic Preordering for Controlled Paraphrase Generation" (ACL 2020).
77 stars 9 forks source link

Waiting for get_ground_truth_alignments.py #7

Closed ramya-raghu25 closed 3 years ago

ramya-raghu25 commented 4 years ago

hi @tagoyal,

Really nice work! Trying to use your algorithm for a custom dataset. Would it be possible for you to release get_ground_truth_alignments.py for REAP?

-Ramya

tagoyal commented 4 years ago

The code resides in the processing folder. The readme there explains it.

You can run create_reap_data.sh to generate the ground truth reap data. More details are in the readme.

ramya-raghu25 commented 4 years ago

I wanted to know how you generate sample_test_sow_reap.txt and sample_test_gt_reap.txt which you give as input in create_sow_data.sh and create_reap_data.sh. You have mentioned you use stanford nlp parser to generate this data. But its unclear: java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,parse -preserveLines -ssplit.eolonly true -outputFormat text -file sample_test_baseline.txt

I would also like to know where is the stanford corenlp parser folder and sample_test_baseline.tok

tagoyal commented 4 years ago

sample_test_baseline.txt is the custom dataset that is used. It contains paraphrase pairs in the following format: sentence1 paraphrase2 [blank line] sentence2 paraphrase2 [blank line] ....

Please download the stanford core nlp module from https://nlp.stanford.edu/software/ the above java command is run with this sample_test_baseline. Please follow their documentation to set up the parser. The command needs to be run from the parser root directory.

This will generate the sample_test_sow_reap.txt file that is required as input for the create_sow_data.sh and create_reap_data.sh.

The sample_test_gt_reap.txt file is one of the intermediate outputs of the create_reap_data.sh. it will be stored in the intermediate folder that you specify.

Hope this helps!