Open pmixer opened 4 years ago
Assume applying multiplicative attention mask instead of additive one(not very good especially when we set it as -2^32-1
which is not small enough for true causality, I tried -1e23
or smaller value in https://github.com/pmixer/TiSASRec.pytorch/blob/9ddc21e400254bc352bb2174fd68bc2cf0585c5b/model.py#L67, it trained slower and even lead to nan
sometimes) is one of reasons why it trains slower(601 epochs instead of 201 epochs for original tf implementation) than original tf implementation.
Hi Guys,
Thx for checking the repo, as you may still meet some problem due to various HW&SW settings, here's few links to help resolve potential issues:
Stay Healthy, Zan