torch.compile error on scaled_dot_product_attention in transformers.LlamaForCausalLM when providing attention_mask: RuntimeError: (*bias): last dimension must be contiguous
Hello there! From the DISABLED prefix in this issue title, it looks like you are attempting to disabled a job in PyTorch CI. The information I have parsed is below:
When I use register_full_backward_hook and register_forward_hook in nn.MultiheadAttention, The MultiheadAttention executed wrong conditional branch and get wrong backward grads.
The "register_module" function got a "key_description=" and reported that "Exception 0xc0000005 encountered at address 0x7ff7fb3dfbf3: Access violation reading location 0xffffffffffffffff"
Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}")
The dynamic library obtained by torch._export.aot_compile will cause cuda illegal memory access problem when running in parallel with multiple threads.
Issue Metrics
torch.compile
error onscaled_dot_product_attention
intransformers.LlamaForCausalLM
when providingattention_mask
:RuntimeError: (*bias): last dimension must be contiguous
magma-cuda
c10::ArrayRef::ArrayRef(const Container&)
SFINAE bugtorch-2.6.0.dev20241029+cu121
torch.fft.rfft
torch.package
warning --TypedStorage
is deprecatedtransforms
argument intorchvision.transforms.v2.Compose
torch._numpy.ndarray.astype()
does not accept Numpy Dtypes correctlyBatchNorm2d
+torch.compile
withreduce-overhead
+ DDPKeyError: 'cls'
pkg_resources
module is deprecated by setuptoolstorch.compile
has inconsistent numerical precision with eager modelist
object is not callablehostmaster+pytorch
_adjust_num_blocks_and_indices
gives wrong adjusted block masktorch.compile
errors when inputs memory overlap.tensor
not aFakeTensor
underFakeTensorMode
anddevice('meta')
dynamo_expected_failures
is silent on certain tests that ended up passingtorch.distributed.tensor.distribute_module
.attn_mask
injagged_scaled_dot_product_attention
test_constant_folding_abi_compatible_cpu
surfacesCUDA error: invalid argument
on H100DataLoader
probably shouldn't usefork
by defaultpytorch/conda-builder
Docker imagesdp_params
and benchmark resultstorch.distributed._state_dict_utils._broadcast_tensors
does not properly support CPU tensors.run_decompositions
fails with pytree error onLlama-3.2-vision
aten.istft
aten::_make_dual
.test_sp24_compile
appears broken on sm80/sm90maybe_mark_dynamic
causes max recursion error when used with compile during tensordict consolidationlinux.aws.a100
oninductor-perf-compare.yml
linux-binary-manywheel / manywheel-py3_10-xpu-build / build
is currently brokents_convert_method_to_trt_engine
functiontorch.std_mean
returnsNaN
as mean of aninf
array.*i8
torch.special.zeta
ignoresnan
input whenother=-inf
.torch.__config__.show()
silently initialises CUDA (?); forked processes fail with uninitialised CUDAtrunc
orfloor
_amp_foreach_non_finite_check_and_unscale_
can be torch.compiled inside torch.amp, but not in identical code outside itdtype
promotion ofout=
functions on meta inputs not consistent.out=
meta device support.torch.nn.functional.softshrink
returns 0 onNaN
input.aten::_trilinear
nanmean
torch.package
bugcpuinfo
error on import(*bias): last dimension must be contiguous
when running compiled SDPA on length 1 tensors