Closed yugaljain1999 closed 4 years ago
One more thing I wanna ask what is the difference between gen_type --gpt-2 CCLM and gen_type --gedi? Both looks similar as both are conditioned on secondary_code and mode like sentiment,detoxify or topic..
Thanks!
My last question is - If I wanna train GeDi on my own data , I have to train whole network or just last layer is enough to train to learn embeddings of additional tokens? Thanks!
Hi! To answer your questions:
--code_desired
and --code_undesired
, but these are set automatically if you run the shell script. To get negative sentiment, run_generation.sh
and set --mode sentiment
. You'll be prompted with the opportunity to change to negative sentiment. When this happens, type n
and press enter, and then you can give the model your prompt.
--gpt2
and cclm
were baselines for --gedi
. --gpt2
just generates from OpenAI's gpt-2 language model using greedy decoding and a repetition penalty (it will be the same regardless of attribute codes). --cclm
generates directly from a language model directly conditioned on an attribute variable. --gedi
is the method described in our paper where we guide generation from gpt-2 using a language model that conditions on an attribute variable. Both --gedi
and --cclm
can control generation to an extent, but --gedi
tends to give much more interesting and diverse responses for different prompts.
If you want to train your own GeDi, it's advisable to train the whole network. Last layer only training would not work as well, and would require some modification to the codebase.
@benkrause Thanks for your valuable responses. I wanna ask one more thing from my last question.. How should I make my labelled dataset because as per your default dataset of AG News, there is four topics and for each sentence one topic is assigned, I also have to make dataset of four topics or I can have more or less than four topics? If I can change number of topics, then which python file or script I should update to change number of topics of a dataset?
Another thing I wanna ask, what is the need of second column in train and test files of AG News as second column sentences are of length 4 to 5 words which I couldn't understand why is it necessary to have?
One last question I wanna ask how can I label each sentence to specific topic as I have just preprocessed text file of sentences? Till now , I have applied LDA to classify sentences but for each sentence I didn't get broad topics like politics,crime or sports instead I am getting set to topics for each sentence..
Thanks!
The second column of AG news is just the article titles, we don't actually use these. Our scripts only process the first and third columns. It assumes the topic labels are in the first column (and start at 1), and the text is in the third column.
If you want to train on your own topic dataset with minimal changes, first set up new csv files in the same format as the AG news train and test csv files. So topic label IDs in the first column, second column can be blank since we ignore it anyway, third column has text.
If you want to avoid having to specify paths in the processing and training scripts, you could save your csv files with the same names in the same directory that we download to (data/AG-news/train.csv
and data/AG-news/test.csv
for the train and test splits). This would be the simplest, but would overwrite AG-news.
Alternatively, you could save it in a new directory if you replace the paths in proc_data.py
and specify the directory in the --data_dir
argument in scripts/run_training.sh
.
Once you have replaced the AG-news train and test csv files with your own, you can process them into a dataset suitable for GeDi with python proc_data.py
. To change the topics, you will have to change the list on line 16
in proc_data.py
, which currently specifies the topic names used for AG-news. Make sure the list corresponds to the topic labels you saved in the csv files, so the first topic in the list should correspond to the label "1", second topic should correspond to the label "2", etc. You can potentially have as many topics as you want, as long as you have data and numbered labels for these topics in your csv file.
As for your last question on how to use unlabeled data, that is something we haven't explored yet, all our experiments so far have used labeled datasets. I will mention that GeDi can often generate to topics it hasn't seen during training. For instance, if you run our topic GeDi trained on AG-news (which was trained on "world", "sports", "business" and "science"), and give it a secondary code of "crime", depending on the prompt, it should sometimes be able to generate text relating to crime.
Hope this helps!
@benkrause Thanks for your valuable response, it really helped me a lot.
@akhileshgotmare I was trying to generate negative sentiment text instead of default one..How can I do that?
Thanks!