Open adrian-spataru opened 3 years ago
For Optimizer, I was lazy and used Rectified Adabelief. Since I never can get Adam correct for transformers. I used the recommended Parameters by the Author. https://github.com/juntang-zhuang/Adabelief-Optimizer
I had good results with Ranger in the past (GPT-like transformers) , but adabelief seems to work better. ( i guess you can easily add GC and Lookhead easily in Adabelief, if it would be useful)
@adrian-spataru If these runs were from more than a day or two ago, you should rerun them because an internet stranger pointed out bugs in the feedforward GLU (my bad)
@adrian-spataru otherwise, thanks for sharing your results :)
@adrian-spataru If these runs were from more than a day or two ago, you should rerun them because an internet stranger pointed out bugs in the feedforward GLU (my bad)
Ok, I will rerun them.
@lucidrains I guess this would be a place to post feedback and results so I will do it shortly as well.
Great job on x-transformer! I am getting good results as well. Plus I really appreciate that it is your original work and that it is not made by Evil Incs :)
Thank you.
Will update shortly.
PS. If you can, please enable Discussions here for non-Issue threads. Its a new GitHub feature that is helpful. Thanks.
@lucidrains So here is my prelim. results/assessment of your x-transformer. I am mostly interested in music applications, but from my experience, if it can play music, it would do great with other tasks, like NLP and such.
Config used:
NUM_BATCHES = int(1e5) BATCH_SIZE = 6 GRADIENT_ACCUMULATE_EVERY = 4 LEARNING_RATE = 1e-4 VALIDATE_EVERY = 100 GENERATE_EVERY = 500 GENERATE_LENGTH = 2048 SEQ_LEN = 2048
===============================
training: 5%|▍ | 4506/100000 [3:07:14<155:49:50, 5.87s/it]training loss: 0.2871239483356476
===============================
Generation: 2048 tokens @ 0.8 temp.
Approx time to generate output = 30 sec!!! Very good IMHO.
================================
Results were very good with music IMHO. Especially considering that this is just a test run/vanilla run. So I will definitely consider turning it to the max and using it in my workflow/production if it will show good results.
Please see attached samples. Not cherry-picked so it's a real deal...
The last 2 samples are one continuation attempt, which is the only thing that did not show good results as of yet. Maybe cuz i did not train it enough... Otherwise, still a pretty decent performance for now IMHO.
============================
Questions:
1) Any suggestions for music AI applications? I will take anything to improve results...
2) Can you add caching/generation-speed-up option that was discussed in another thread ? This would be very helpful indeed.
3) How do generate without primer? Or rather, is there any way to specify 1 token primer? I am using your simple wiki8 example for a codebase.
==============================
Overall, great job! Thank you.
Alex
For Optimizer, I was lazy and used Rectified Adabelief. Since I never can get Adam correct for transformers. I used the recommended Parameters by the Author. https://github.com/juntang-zhuang/Adabelief-Optimizer
I had good results with Ranger in the past (GPT-like transformers) , but adabelief seems to work better. ( i guess you can easily add GC and Lookhead easily in Adabelief, if it would be useful)
@adrian-spataru you might be interested in Ranger21, it's an up to date version of Ranger that has been tested extensively on transformers. The improvements are significant.
I have run some models in the past weeks. All of them being encoder-decoder transformers. I am not sure where is the right place to write stuff like this, but I'll write them here for now.
Word of Caution: My particular use case is not NLP. But its a corpus with around 200M Tokens and vocab_size of 1k
This did lead to faster convergence in the beginning, but performance was slightly worse. (Ran 2 Runs)
Took longer to converge and wasn't better (Ran 2 Runs)
Didn't converged for me and became after a while NaN (Ran 2 Runs)
Converged quicker and was better, even when wrongly configured (used max_distance 128, instead of 512, which is my max_seq_len) For Seq_len of 512, a bucket_size of 64, was better than default 32. (One Run each)
Didn't noticed anything for my usecase ( 1 Run only)