On line 95 of casual_linear_attention.py there is a normalizing constant computed by looking at all queries and keys. This creates a dependency on future values which would be unknown in a real prediction task.
This is not the case. Since we compute the normalizing constant using the cumsum of K, we are actually using information up to the current time step and not from the future.
On line 95 of casual_linear_attention.py there is a normalizing constant computed by looking at all queries and keys. This creates a dependency on future values which would be unknown in a real prediction task.