lucidrains / x-transformers

A concise but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.78k stars 417 forks source link

My Experience with X-Transformers #12

Open adrian-spataru opened 3 years ago

adrian-spataru commented 3 years ago

I have run some models in the past weeks. All of them being encoder-decoder transformers. I am not sure where is the right place to write stuff like this, but I'll write them here for now.

Word of Caution: My particular use case is not NLP. But its a corpus with around 200M Tokens and vocab_size of 1k

Transformers Without Tears Researchers have shared with me this leads to faster convergence.

This did lead to faster convergence in the beginning, but performance was slightly worse. (Ran 2 Runs)

GLU Variants Improve Transformer

Took longer to converge and wasn't better (Ran 2 Runs)

Rezero Is All You Need

Didn't converged for me and became after a while NaN (Ran 2 Runs)

T5's Simplified Relative Positional Encoding

Converged quicker and was better, even when wrongly configured (used max_distance 128, instead of 512, which is my max_seq_len) For Seq_len of 512, a bucket_size of 64, was better than default 32. (One Run each)

Talking-Heads Attention

Didn't noticed anything for my usecase ( 1 Run only)

adrian-spataru commented 3 years ago

For Optimizer, I was lazy and used Rectified Adabelief. Since I never can get Adam correct for transformers. I used the recommended Parameters by the Author. https://github.com/juntang-zhuang/Adabelief-Optimizer

I had good results with Ranger in the past (GPT-like transformers) , but adabelief seems to work better. ( i guess you can easily add GC and Lookhead easily in Adabelief, if it would be useful)

lucidrains commented 3 years ago

@adrian-spataru If these runs were from more than a day or two ago, you should rerun them because an internet stranger pointed out bugs in the feedforward GLU (my bad)

lucidrains commented 3 years ago

@adrian-spataru otherwise, thanks for sharing your results :)

adrian-spataru commented 3 years ago

@adrian-spataru If these runs were from more than a day or two ago, you should rerun them because an internet stranger pointed out bugs in the feedforward GLU (my bad)

Ok, I will rerun them.

asigalov61 commented 3 years ago

@lucidrains I guess this would be a place to post feedback and results so I will do it shortly as well.

Great job on x-transformer! I am getting good results as well. Plus I really appreciate that it is your original work and that it is not made by Evil Incs :)

Thank you.

Will update shortly.

PS. If you can, please enable Discussions here for non-Issue threads. Its a new GitHub feature that is helpful. Thanks.

asigalov61 commented 3 years ago

@lucidrains So here is my prelim. results/assessment of your x-transformer. I am mostly interested in music applications, but from my experience, if it can play music, it would do great with other tasks, like NLP and such.

Config used:

NUM_BATCHES = int(1e5) BATCH_SIZE = 6 GRADIENT_ACCUMULATE_EVERY = 4 LEARNING_RATE = 1e-4 VALIDATE_EVERY = 100 GENERATE_EVERY = 500 GENERATE_LENGTH = 2048 SEQ_LEN = 2048

===============================

training: 5%|▍ | 4506/100000 [3:07:14<155:49:50, 5.87s/it]training loss: 0.2871239483356476

===============================

Generation: 2048 tokens @ 0.8 temp.

Approx time to generate output = 30 sec!!! Very good IMHO.

================================

Results were very good with music IMHO. Especially considering that this is just a test run/vanilla run. So I will definitely consider turning it to the max and using it in my workflow/production if it will show good results.

Please see attached samples. Not cherry-picked so it's a real deal...

The last 2 samples are one continuation attempt, which is the only thing that did not show good results as of yet. Maybe cuz i did not train it enough... Otherwise, still a pretty decent performance for now IMHO.

============================

Questions:

1) Any suggestions for music AI applications? I will take anything to improve results...

2) Can you add caching/generation-speed-up option that was discussed in another thread ? This would be very helpful indeed.

3) How do generate without primer? Or rather, is there any way to specify 1 token primer? I am using your simple wiki8 example for a codebase.

==============================

Overall, great job! Thank you.

Alex

Music-XTransformer-Output-MIDI-Samples.zip

nestordemeure commented 3 years ago

For Optimizer, I was lazy and used Rectified Adabelief. Since I never can get Adam correct for transformers. I used the recommended Parameters by the Author. https://github.com/juntang-zhuang/Adabelief-Optimizer

I had good results with Ranger in the past (GPT-like transformers) , but adabelief seems to work better. ( i guess you can easily add GC and Lookhead easily in Adabelief, if it would be useful)

@adrian-spataru you might be interested in Ranger21, it's an up to date version of Ranger that has been tested extensively on transformers. The improvements are significant.