Open KaidDuong opened 4 months ago
I'm also finding that the tltorch tensorized layers are significantly slower than the standard torch fully connected layers. Did you ever find a solution to this issue?
What implementation do you use? reconstructed or factorized? Reconstructed should be around the same amount of time as the unfactorized version. The factorized version will depend significantly on your specific use case: size of your matrix, size of the factorization, etc.
I verified the runtime of both factorized and reconstructed implements, but also the inference time is still slower than Neural Network.
import tltorch
import torch
from torch.profiler import profile, record_function, ProfilerActivity
data = torch.randn((4, 1024), dtype=torch.float32)
linear = torch.nn.Linear(1024, 512)
fact_linear_w_factorized = tltorch.FactorizedLinear.from_linear(linear, auto_tensorize=False,
in_tensorized_features=(4, 4, 4, 4, 4), out_tensorized_features=(2,4, 4, 4, 4), rank=0.1, factorization="tucker", implementation="factorized")
fact_linear_w_reconstructed = tltorch.FactorizedLinear.from_linear(linear, auto_tensorize=False,
in_tensorized_features=(4, 4, 4, 4, 4), out_tensorized_features=(2,4, 4, 4, 4), rank=0.1, factorization="tucker", implementation="reconstructed")
data = data.to("cuda")
linear = linear.to("cuda")
fact_linear_w_factorized = fact_linear_w_factorized.to("cuda")
fact_linear_w_reconstructed = fact_linear_w_reconstructed.to("cuda")
%%timeit
with torch.no_grad():
fact_linear_w_reconstructed(data)
=> 880 µs ± 8.69 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each
%%timeit
with torch.no_grad():
fact_linear_w_factorized(data)
=> 2.74 ms ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
with torch.no_grad():
linear(data)
=> 38.7 µs ± 946 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)