salesforce / GeDi

GeDi: Generative Discriminator Guided Sequence Generation
https://arxiv.org/abs/2009.06367
BSD 3-Clause "New" or "Revised" License
208 stars 46 forks source link

What to change in secondary_code argument for mode --sentiment if I want to generate negative sentiment text? #1

Closed yugaljain1999 closed 4 years ago

yugaljain1999 commented 4 years ago

@akhileshgotmare I was trying to generate negative sentiment text instead of default one..How can I do that?

Thanks!

yugaljain1999 commented 4 years ago

One more thing I wanna ask what is the difference between gen_type --gpt-2 CCLM and gen_type --gedi? Both looks similar as both are conditioned on secondary_code and mode like sentiment,detoxify or topic..

Thanks!

yugaljain1999 commented 4 years ago

My last question is - If I wanna train GeDi on my own data , I have to train whole network or just last layer is enough to train to learn embeddings of additional tokens? Thanks!

benkrause commented 4 years ago

Hi! To answer your questions:

  1. Sentiment doesn't use a secondary code, it only uses --code_desired and --code_undesired, but these are set automatically if you run the shell script.

To get negative sentiment, run_generation.sh and set --mode sentiment. You'll be prompted with the opportunity to change to negative sentiment. When this happens, type n and press enter, and then you can give the model your prompt.

  1. --gpt2 and cclm were baselines for --gedi. --gpt2 just generates from OpenAI's gpt-2 language model using greedy decoding and a repetition penalty (it will be the same regardless of attribute codes). --cclm generates directly from a language model directly conditioned on an attribute variable. --gedi is the method described in our paper where we guide generation from gpt-2 using a language model that conditions on an attribute variable. Both --gedi and --cclm can control generation to an extent, but --gedi tends to give much more interesting and diverse responses for different prompts.

  2. If you want to train your own GeDi, it's advisable to train the whole network. Last layer only training would not work as well, and would require some modification to the codebase.

yugaljain1999 commented 4 years ago

@benkrause Thanks for your valuable responses. I wanna ask one more thing from my last question.. How should I make my labelled dataset because as per your default dataset of AG News, there is four topics and for each sentence one topic is assigned, I also have to make dataset of four topics or I can have more or less than four topics? If I can change number of topics, then which python file or script I should update to change number of topics of a dataset?

Another thing I wanna ask, what is the need of second column in train and test files of AG News as second column sentences are of length 4 to 5 words which I couldn't understand why is it necessary to have?

image

One last question I wanna ask how can I label each sentence to specific topic as I have just preprocessed text file of sentences? Till now , I have applied LDA to classify sentences but for each sentence I didn't get broad topics like politics,crime or sports instead I am getting set to topics for each sentence..

Thanks!

benkrause commented 4 years ago

The second column of AG news is just the article titles, we don't actually use these. Our scripts only process the first and third columns. It assumes the topic labels are in the first column (and start at 1), and the text is in the third column.

If you want to train on your own topic dataset with minimal changes, first set up new csv files in the same format as the AG news train and test csv files. So topic label IDs in the first column, second column can be blank since we ignore it anyway, third column has text.

Once you have replaced the AG-news train and test csv files with your own, you can process them into a dataset suitable for GeDi with python proc_data.py. To change the topics, you will have to change the list on line 16 in proc_data.py, which currently specifies the topic names used for AG-news. Make sure the list corresponds to the topic labels you saved in the csv files, so the first topic in the list should correspond to the label "1", second topic should correspond to the label "2", etc. You can potentially have as many topics as you want, as long as you have data and numbered labels for these topics in your csv file.

As for your last question on how to use unlabeled data, that is something we haven't explored yet, all our experiments so far have used labeled datasets. I will mention that GeDi can often generate to topics it hasn't seen during training. For instance, if you run our topic GeDi trained on AG-news (which was trained on "world", "sports", "business" and "science"), and give it a secondary code of "crime", depending on the prompt, it should sometimes be able to generate text relating to crime.

Hope this helps!

yugaljain1999 commented 4 years ago

@benkrause Thanks for your valuable response, it really helped me a lot.