@LWprogramming oh, so the latest transformers literature actually finds dropout to be not that useful past a certain scale, which is why i keep those at 0, but still have the logic in there in case some traditionalists want to turn it on
i've already incorporated the best kind of structural dropout for autoregressive transformers!
"Turn on, tune in, drop out" I guess is a bit old, but some of the wisdom still remains. Perhaps we ML practitioners should "Turn on, tune in, and drop out dropout"?
@LWprogramming oh, so the latest transformers literature actually finds dropout to be not that useful past a certain scale, which is why i keep those at 0, but still have the logic in there in case some traditionalists want to turn it on
i've already incorporated the best kind of structural dropout for autoregressive transformers!