GenTxt commented 3 years ago

Digging deeper into your repo and checking all the options.

Parody process works fine with topic="Seed sentence here ..." Outputs to terminal and file but default topic=None hangs the process.

nvidia-smi shows shows same gpu memory use but nothing appears in terminal.

Attempted fix:

Removed topic=None and topic_prefix="" from def parody
Then edited the following:

if topic: toks1 = tokenizer.tokenize("{0} {1} {2}. " .format(eos_token, topic_prefix, topic)) else: toks1 = [eos_token] start = len(toks1)

to:

toks1 = [eos_token] start = len(toks1)

Same result. Process appears to hang.

Using ubuntu 18.04 and python 3.6.9

Would appreciate any suggestions. Will test your updated repo.

Cheers,

jeffbinder commented 3 years ago

Could you send the full code you used?

Overall the parody procedure is not very reliable. I haven't put as much effort into it because I've been getting much more promising results with BERT-style models.

GenTxt commented 3 years ago

Hi Jeff.

Did a bit more testing. Seems my larger 774M models are the issue. Just tested a 345M model and no problem. Better results, as you say, from BERT-style models.

Attached my script for reference.

Cheers

On Sat, Dec 5, 2020 at 10:56 AM Jeffrey M. Binder notifications@github.com wrote:

Could you send the full code you used?

Overall the parody procedure is not very reliable. I haven't put as much effort into it because I've been getting much more promising results with BERT-style models.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-739313754, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPIHBDE6K6PBSOKJGM3STJJ3PANCNFSM4UNVAOYQ .

GenTxt commented 3 years ago

Hi Jeff:

I forgot to mention that [topic="And what happened next " ] affects each of the output lines based on lines from the input text similar to repeating variations on the same theme.

I was thinking that if topic= became another text input process:

file_open = open("topic_text.txt", "r") etc. then each input line would be affected by a different line from 'topic.txt' and therefore generate different output lines instead of similar variations of the first.

If you have a suggestion on how to implement that would be great. I'll check some code snippets from other scripts and see if something works.

Cheers

On Sat, Dec 5, 2020 at 1:06 PM Aaron Allan ronnytoronto1@gmail.com wrote:

Hi Jeff.

Did a bit more testing. Seems my larger 774M models are the issue. Just tested a 345M model and no problem. Better results, as you say, from BERT-style models.

Attached my script for reference.

Cheers

On Sat, Dec 5, 2020 at 10:56 AM Jeffrey M. Binder < notifications@github.com> wrote:

Could you send the full code you used?

Overall the parody procedure is not very reliable. I haven't put as much effort into it because I've been getting much more promising results with BERT-style models.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-739313754, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPIHBDE6K6PBSOKJGM3STJJ3PANCNFSM4UNVAOYQ .

jeffbinder commented 3 years ago

Thanks! I don't seem to have gotten the attachment, but I've been able to produce a similar behavior with a smaller model. I will try to fix it when I get a chance.

That's an interesting suggestion with respect to different topics for each line! To do that, you could split the text into lines and run the depoeticize or parody procedure for each line individually. To maintain grammatical coherence, you could also provide the model with the previous and next n lines, encased in {} brackets (which tells the program not to modify that text). This is similar to what the banalify function does, but that function is based on chunks with set numbers of tokens, not on lines. Right now the banalify function always uses the same title hint for every chunk, but that could easily be changed.

The topic/title feature works much better if you finetune the model for this particular task. You will have to train it on a bunch of texts with annotations formatted the same way as the topic hints. The reason I changed the depoeticize parameter from "topic" to "title" was that it was easier to get training data for the latter—namely, a bunch of poems with their titles marked up. I didn't bother doing this with the parody function because it didn't seem worth the trouble to finetune the GPT2 model.

GenTxt commented 3 years ago

Attached as .txt file

Examining the output it appears that the script is always trying to write the maximum 1023 tokens for each input line but the generated lines are all limited by the number of words/tokens of each input line.

For example the following 45 word line was generated from a 774M model based on a 45 word input line using topic="And what happened next."

He went to the office and sat on the cot where he had sat on the night before and waited for her to come out of the bedroom, still clutching the coat, and drew out the two heavy pistols and laid them on the desk.

A new input file containing lines of varying lengths generated something like this:

He went to the office and sat on the cot (based on a 10 word input file line) He went to the office and sat on the cot where he had sat on the night before and (based on a 19 word input file line) He went to the (based on a 4 word input line) etc.

The input file functions as a template for output line length while the topic="And what happened next." or 'topics.txt' primes the gpt2 model for generation. Nice combination.

On Sat, Dec 5, 2020 at 2:36 PM Jeffrey M. Binder notifications@github.com wrote:

Thanks! I don't seem to have gotten the attachment, but I've been able to produce a similar behavior with a smaller model. I will try to fix it when I get a chance.

That's an interesting suggestion with respect to different topics for each line! To do that, you could split the text into lines and run the depoeticize or parody procedure for each line individually. To maintain grammatical coherence, you could also provide the model with the previous and next n lines, encased in {} brackets (which tells the program not to modify that text). This is similar to what the banalify function does, but that function is based on chunks with set numbers of tokens, not on lines. Right now the banalify function always uses the same title hint for every chunk, but that could easily be changed.

The topic/title feature works much better if you finetune the model for this particular task. You will have to train it on a bunch of texts with annotations formatted the same way as the topic hints. The reason I changed the depoeticize parameter from "topic" to "title" was that it was easier to get training data for the latter—namely, a bunch of poems with their titles marked up. I didn't bother doing this with the parody function because it didn't seem worth the trouble to finetune the GPT2 model.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-739357385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPOMSC34VZ3E5NWIOATSTKDUVANCNFSM4UNVAOYQ .

from visions import *

file_open = open("input.txt", "r") file_output = open("output.txt", "w")

for line in file_open: output = parody(line, model='models/gpt2', match_meter=False, #True match_rhyme=False, #True topic=None,

topic="And this is what happened next ",

                randomize=0.005, #0.00
                verbose=True,
                modifier=None)

print ("%s"%(output), file=file_output)

print (output, file=file_output)

print (output +"\n", file=file_output)

file_open.close() file_output.close()

jeffbinder commented 3 years ago

This is the expected behavior, including the fact that the lines get cut off in mid-sentence. The problem is that, since GPT2 cannot account for what is coming up ahead, it often starts down paths that don't fit the pattern; this is the main reason I switched to BERT.

jeffbinder commented 3 years ago

I pushed a change that fixes the issue I was having with the parody function getting stuck in a loop. The meter matching feature was not properly handling the tokens GPT2 inserts at the ends of lines, which was preventing the model from generating anything. However, this problem only occurs with match_meter=True, so I'm not sure if it's the same problem you were encountering.

GenTxt commented 3 years ago

Thanks for the update. Working now with topic=None although topic=" sentence here ..." generates better results. Still trying to figure out how to add topict= as a subroutine to read from a 'topics.txt' file

Also trying to fine-tune a Bert masked LM. Have no problem generating gpt2 checkpoints but transformer repos 2.11 to 4.00 not completing Bert/Roberta fine-tuning using unlabeled text corpus. Possible to save a checkpoint with high loss around iteration 1000. Gets about 2-10% completion then weird 'RuntimeError: The size of tensor a (569) must match the size of tensor b (512) at non-singleton dimension 1'

Also, found 'Corpus of Historical American English 1800-2000' - models for each decade.

https://huggingface.co/models?search=coha

"architectures": [ "RobertaForMaskedLM"

Unfortunately, when used with 'depoeticize' function it replaces almost every word with 'the' (???)

Will stick with the default Bert/Roberta models for now.

Cheers,

On Tue, Dec 8, 2020 at 12:07 PM Jeffrey M. Binder notifications@github.com wrote:

I pushed a change that fixes the issue I was having with the parody function getting stuck in a loop. The meter matching feature was not properly handling the tokens GPT2 inserts at the ends of lines, which was preventing the model from generating anything. However, this problem only occurs with match_meter=True, so I'm not sure if it's the same problem you were encountering.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-740768247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPMQEM4TOGPVKIOLXQTSTZMO7ANCNFSM4UNVAOYQ .

jeffbinder commented 3 years ago

I'm not sure why you'd get that error while finetuning. I finetuned BERT on the Poetry Foundation Corpus, which I cleaned and reformatted using the clean_pofo_corpus.py script that's in the repo. Here is the command I used to finetune, using the script included with the transformers package:

python ../transformers/examples/language-modeling/run_language_modeling.py \ --output_dir=pofo \ --model_type=bert \ --model_name_or_path=bert-base-uncased \ --do_train \ --train_data_file=pofo.corpus \ --mlm

I added some new visualization options (see the outfile parameter of depoeticize). I'll use this to scope out what might be going on with the COHA model.

GenTxt commented 3 years ago

Thanks for the update.

Did a bit more digging and the simple solution was using:

--max_seq_length= (32-512) and --line_by_line

The huggingface tutorial only mentions --line_by_line

Before that I stumbled upon using fmt to format the corpus lines into a uniform size. That worked but --max_seq_length= allows for any text corpus line size.

Will check out the updated code.

Enjoy the holidays

On Tue, Dec 22, 2020 at 2:48 PM Jeffrey M. Binder notifications@github.com wrote:

I'm not sure why you'd get that error while finetuning. I finetuned BERT on the Poetry Foundation Corpus, which I cleaned and reformatted using the clean_pofo_corpus.py script that's in the repo. Here is the command I used to finetune, using the script included with the transformers package:

python ../transformers/examples/language-modeling/run_language_modeling.py --output_dir=pofo --model_type=bert --model_name_or_path=bert-base-uncased --do_train --train_data_file=pofo.corpus --mlm

I added some new visualization options (see the outfile parameter of depoeticize). I'll use this to scope out what might be going on with the COHA model.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffbinder/visions-and-revisions/issues/2#issuecomment-749743498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMAWPIZWSYZZO7JORXD4N3SWDZ2BANCNFSM4UNVAOYQ .

jeffbinder / visions-and-revisions

parody option 'topic=None' appears to hang process #2

topic="And this is what happened next ",

print ("%s"%(output), file=file_output)

print (output, file=file_output)