thu-coai / NAST

Codes for "NAST: A Non-Autoregressive Generator with Word Alignment for Unsupervised Text Style Transfer" (ACL 2021 findings)
MIT License
15 stars 1 forks source link

Running with Custom Dataset #3

Open aflah02 opened 2 years ago

aflah02 commented 2 years ago

Hey! Great Paper!! Can you share some instructions for formatting a custom dataset as well?

hzhwcmhf commented 2 years ago

You can download the preprocessed yelp dataset and format your custom dataset following the instruction in Readmes.

Please feel free to ask if you have any further questions.

aflah02 commented 2 years ago

@hzhwcmhf Thanks for the instructions!

aflah02 commented 2 years ago

@hzhwcmhf Can the test files be run without multiple human references as well? I see the paper mentions Luo et al. (2019) for the Yelp dataset as they provided multiple references but for GYAFC there is no such mention. I don't have multiple human references hence would like to know if the code already auto handles single references or would I need to make the changes manually?

hzhwcmhf commented 2 years ago

Hi, @aflah02

First, we use multiple human references as well for GYAFC. You can find the references here. Multiple references are recommended in evaluating style transfer models since they can cover more possible transferred phrases, leading to reliable results.

Second, it should be ok if you test files only contain one reference per sample. For example, the test file can be

ever since joes has changed hands it 's just gotten worse and worse . ever since joes has changed hands it 's gotten better and better .

there is definitely not enough room in that part of the venue . there is so much room in that part of the venue

...... (NOTE: THE BLANK LINE IS REQUIRED)

(If it does not work, please tell me. I will figure out the problem.)

Moreover, you can change the format of input file here

https://github.com/thu-coai/NAST/blob/ef765d412f6e9a2ebdcc7d62c99ec2e883d0e17a/styletransformer/main.py#L35-L42

where SentenceDefault indicates a line, and SessionDefault indicates mutliple lines with an empty line as ending.