lucidrains / routing-transformer

Fully featured implementation of Routing Transformer
MIT License
282 stars 29 forks source link

Why doesn't AutoregressiveWrapper sum the encoder aux loss? #9

Closed tomweingarten closed 4 years ago

tomweingarten commented 4 years ago

Sorry if this is a dumb question, but I couldn't find a good explanation. The auxiliary loss of the decoder is summed with the cross-entropy loss and returned for back-prorogation. The auxiliary loss of the encoder is just thrown away. What's the rationale for that? Thanks!

lucidrains commented 4 years ago

@tomweingarten Hi Tom! I cannot believe you spotted this! The reason is because there is an outstanding bug that I couldn't solve. The bug occurs in a specific edge case when reversibility is turned on for the decoder (encoder is fine). For some reason, having auxiliary losses summed up from both encoder / decoder breaks things. Feel free to leave this open until I (or maybe you?) resolve it!

lucidrains commented 4 years ago

@tomweingarten Are you using reversibility in your decoder? If not, I could push a new version where both auxiliary losses are summed when decoder is not using reversibility. I still found that the networks converge even without the commitment loss, so I thought I would just omit the encoder commitment loss.

lucidrains commented 4 years ago

@tomweingarten I just pushed a new version where the encoder auxiliary loss is added in the case that the decoder is not reversible. I just realized, since I added mixture of experts recently (and it needs an auxiliary loss), that I may have to disable reversibility in decoder altogether until this is fixed

lucidrains commented 4 years ago

also, what are you training RT on? 🤔

lucidrains commented 4 years ago

@tomweingarten I discovered a semi-reasonable solution to the problem! https://github.com/lucidrains/routing-transformer/commit/93ec372172b2c0d4547d4855d74577ea5b41de82

lucidrains commented 4 years ago

closing because solution is found, even if not the cleanest

tomweingarten commented 4 years ago

Thanks for the fix and sorry for my slow reply! I'll email you with some more details. I'm mostly interested in biological uses for Transformers but for fun I've been playing around on a timeseries with mixed inputs and outputs.

lucidrains commented 4 years ago

@tomweingarten yea! I think sparse attention is perfect for bioinformatics!