Closed GenTxt closed 3 years ago
Could you send the full code you used?
Overall the parody procedure is not very reliable. I haven't put as much effort into it because I've been getting much more promising results with BERT-style models.
Hi Jeff.
Did a bit more testing. Seems my larger 774M models are the issue. Just tested a 345M model and no problem. Better results, as you say, from BERT-style models.
Attached my script for reference.
Cheers
On Sat, Dec 5, 2020 at 10:56 AM Jeffrey M. Binder notifications@github.com wrote:
Could you send the full code you used?
Overall the parody procedure is not very reliable. I haven't put as much effort into it because I've been getting much more promising results with BERT-style models.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-739313754, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPIHBDE6K6PBSOKJGM3STJJ3PANCNFSM4UNVAOYQ .
Hi Jeff:
I forgot to mention that [topic="And what happened next " ] affects each of the output lines based on lines from the input text similar to repeating variations on the same theme.
I was thinking that if topic= became another text input process:
file_open = open("topic_text.txt", "r") etc. then each input line would be affected by a different line from 'topic.txt' and therefore generate different output lines instead of similar variations of the first.
If you have a suggestion on how to implement that would be great. I'll check some code snippets from other scripts and see if something works.
Cheers
On Sat, Dec 5, 2020 at 1:06 PM Aaron Allan ronnytoronto1@gmail.com wrote:
Hi Jeff.
Did a bit more testing. Seems my larger 774M models are the issue. Just tested a 345M model and no problem. Better results, as you say, from BERT-style models.
Attached my script for reference.
Cheers
On Sat, Dec 5, 2020 at 10:56 AM Jeffrey M. Binder < notifications@github.com> wrote:
Could you send the full code you used?
Overall the parody procedure is not very reliable. I haven't put as much effort into it because I've been getting much more promising results with BERT-style models.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-739313754, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPIHBDE6K6PBSOKJGM3STJJ3PANCNFSM4UNVAOYQ .
Thanks! I don't seem to have gotten the attachment, but I've been able to produce a similar behavior with a smaller model. I will try to fix it when I get a chance.
That's an interesting suggestion with respect to different topics for each line! To do that, you could split the text into lines and run the depoeticize or parody procedure for each line individually. To maintain grammatical coherence, you could also provide the model with the previous and next n lines, encased in {} brackets (which tells the program not to modify that text). This is similar to what the banalify function does, but that function is based on chunks with set numbers of tokens, not on lines. Right now the banalify function always uses the same title hint for every chunk, but that could easily be changed.
The topic/title feature works much better if you finetune the model for this particular task. You will have to train it on a bunch of texts with annotations formatted the same way as the topic hints. The reason I changed the depoeticize parameter from "topic" to "title" was that it was easier to get training data for the latter—namely, a bunch of poems with their titles marked up. I didn't bother doing this with the parody function because it didn't seem worth the trouble to finetune the GPT2 model.
Attached as .txt file
Examining the output it appears that the script is always trying to write the maximum 1023 tokens for each input line but the generated lines are all limited by the number of words/tokens of each input line.
For example the following 45 word line was generated from a 774M model based on a 45 word input line using topic="And what happened next."
He went to the office and sat on the cot where he had sat on the night before and waited for her to come out of the bedroom, still clutching the coat, and drew out the two heavy pistols and laid them on the desk.
A new input file containing lines of varying lengths generated something like this:
He went to the office and sat on the cot (based on a 10 word input file line) He went to the office and sat on the cot where he had sat on the night before and (based on a 19 word input file line) He went to the (based on a 4 word input line) etc.
The input file functions as a template for output line length while the topic="And what happened next." or 'topics.txt' primes the gpt2 model for generation. Nice combination.
On Sat, Dec 5, 2020 at 2:36 PM Jeffrey M. Binder notifications@github.com wrote:
Thanks! I don't seem to have gotten the attachment, but I've been able to produce a similar behavior with a smaller model. I will try to fix it when I get a chance.
That's an interesting suggestion with respect to different topics for each line! To do that, you could split the text into lines and run the depoeticize or parody procedure for each line individually. To maintain grammatical coherence, you could also provide the model with the previous and next n lines, encased in {} brackets (which tells the program not to modify that text). This is similar to what the banalify function does, but that function is based on chunks with set numbers of tokens, not on lines. Right now the banalify function always uses the same title hint for every chunk, but that could easily be changed.
The topic/title feature works much better if you finetune the model for this particular task. You will have to train it on a bunch of texts with annotations formatted the same way as the topic hints. The reason I changed the depoeticize parameter from "topic" to "title" was that it was easier to get training data for the latter—namely, a bunch of poems with their titles marked up. I didn't bother doing this with the parody function because it didn't seem worth the trouble to finetune the GPT2 model.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-739357385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPOMSC34VZ3E5NWIOATSTKDUVANCNFSM4UNVAOYQ .
from visions import *
file_open = open("input.txt", "r") file_output = open("output.txt", "w")
for line in file_open: output = parody(line, model='models/gpt2', match_meter=False, #True match_rhyme=False, #True topic=None,
randomize=0.005, #0.00
verbose=True,
modifier=None)
print (output +"\n", file=file_output)
file_open.close() file_output.close()
This is the expected behavior, including the fact that the lines get cut off in mid-sentence. The problem is that, since GPT2 cannot account for what is coming up ahead, it often starts down paths that don't fit the pattern; this is the main reason I switched to BERT.
I pushed a change that fixes the issue I was having with the parody function getting stuck in a loop. The meter matching feature was not properly handling the tokens GPT2 inserts at the ends of lines, which was preventing the model from generating anything. However, this problem only occurs with match_meter=True, so I'm not sure if it's the same problem you were encountering.
Thanks for the update. Working now with topic=None although topic=" sentence here ..." generates better results. Still trying to figure out how to add topict= as a subroutine to read from a 'topics.txt' file
Also trying to fine-tune a Bert masked LM. Have no problem generating gpt2 checkpoints but transformer repos 2.11 to 4.00 not completing Bert/Roberta fine-tuning using unlabeled text corpus. Possible to save a checkpoint with high loss around iteration 1000. Gets about 2-10% completion then weird 'RuntimeError: The size of tensor a (569) must match the size of tensor b (512) at non-singleton dimension 1'
Also, found 'Corpus of Historical American English 1800-2000' - models for each decade.
https://huggingface.co/models?search=coha
"architectures": [ "RobertaForMaskedLM"
Unfortunately, when used with 'depoeticize' function it replaces almost every word with 'the' (???)
Will stick with the default Bert/Roberta models for now.
Cheers,
On Tue, Dec 8, 2020 at 12:07 PM Jeffrey M. Binder notifications@github.com wrote:
I pushed a change that fixes the issue I was having with the parody function getting stuck in a loop. The meter matching feature was not properly handling the tokens GPT2 inserts at the ends of lines, which was preventing the model from generating anything. However, this problem only occurs with match_meter=True, so I'm not sure if it's the same problem you were encountering.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-740768247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPMQEM4TOGPVKIOLXQTSTZMO7ANCNFSM4UNVAOYQ .
I'm not sure why you'd get that error while finetuning. I finetuned BERT on the Poetry Foundation Corpus, which I cleaned and reformatted using the clean_pofo_corpus.py script that's in the repo. Here is the command I used to finetune, using the script included with the transformers package:
python ../transformers/examples/language-modeling/run_language_modeling.py \ --output_dir=pofo \ --model_type=bert \ --model_name_or_path=bert-base-uncased \ --do_train \ --train_data_file=pofo.corpus \ --mlm
I added some new visualization options (see the outfile parameter of depoeticize). I'll use this to scope out what might be going on with the COHA model.
Thanks for the update.
Did a bit more digging and the simple solution was using:
--max_seq_length= (32-512) and --line_by_line
The huggingface tutorial only mentions --line_by_line
Before that I stumbled upon using fmt to format the corpus lines into a uniform size. That worked but --max_seq_length= allows for any text corpus line size.
Will check out the updated code.
Enjoy the holidays
On Tue, Dec 22, 2020 at 2:48 PM Jeffrey M. Binder notifications@github.com wrote:
I'm not sure why you'd get that error while finetuning. I finetuned BERT on the Poetry Foundation Corpus, which I cleaned and reformatted using the clean_pofo_corpus.py script that's in the repo. Here is the command I used to finetune, using the script included with the transformers package:
python ../transformers/examples/language-modeling/run_language_modeling.py --output_dir=pofo --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=pofo.corpus --mlm
I added some new visualization options (see the outfile parameter of depoeticize). I'll use this to scope out what might be going on with the COHA model.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-749743498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPIZWSYZZO7JORXD4N3SWDZ2BANCNFSM4UNVAOYQ .
Digging deeper into your repo and checking all the options.
Parody process works fine with topic="Seed sentence here ..." Outputs to terminal and file but default topic=None hangs the process.
nvidia-smi shows shows same gpu memory use but nothing appears in terminal.
Attempted fix:
Removed topic=None and topic_prefix="" from def parody
Then edited the following:
if topic: toks1 = tokenizer.tokenize("{0} {1} {2}. " .format(eos_token, topic_prefix, topic)) else: toks1 = [eos_token] start = len(toks1)
to:
toks1 = [eos_token] start = len(toks1)
Same result. Process appears to hang.
Using ubuntu 18.04 and python 3.6.9
Would appreciate any suggestions. Will test your updated repo.
Cheers,