Closed BenoitDalFerro closed 1 year ago
@wl2776 you already fixed it in 8336807, I can build with local copy so my hunch is you have not branch merged to main commit #106 hence the pip release has not been updated downstream
Although the code compiles now, I cannot confirm that everything is working.
Two of unit-tests are failing and this failure means that something is wrong either in C++ and CUDA, or somewhere on the boundary between Python and native code.
@wl2776 thanks, any idea when this might be fixed, days or weeks ?
Sorry, no ideas.
You are welcome to contribute.
Actually, only one unit-test is failing, GPU clustering. And I don't know if GPU clustering was working before.
I think, this module can be disabled since CPU clustering also exists.
@wl2776 whilst I work on my end would you mind giving me a little more specifics about your idea, file, line involved ?
Failing unit-test is this https://github.com/wl2776/fast-transformers/blob/master/tests/clustering/hamming/test_cluster_gpu.py#L48
When bits == 32
, value of centroids
is [-1, 0]
instead of [0, 2**32 - 1]
It calls this code: https://github.com/wl2776/fast-transformers/blob/master/fast_transformers/clustering/hamming/cluster_cuda.cu
I have added print(centroids)
before self.assertEqual()
in that test. CUDA code returns [-1, 0]
The issue might be in wrong signed-unsigned type cast
@wl2776 why is this commented out ?
// initialize the centroids
//centroids.view({-1, K}) = hashes.view({-1, L}).narrow(1, 0, K);
as per cluster
This comment was in original repo
Could you point to the original repo so I can diff it ?
On 7 Mar 2023, at 09:51, wl2776 @.***> wrote:
This comment was in original repo
— Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fidiap%2Ffast-transformers%2Fissues%2F121%23issuecomment-1457780851&data=05%7C01%7C%7C7eba31ea911e493a34f708db1ee924ec%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638137758912999812%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BbcKmPnyS22equMffr%2FuEgIdzQy94e%2Fe5ca%2Bsx%2BTmww%3D&reserved=0, or unsubscribehttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQTXJEXMWOO2ISIU4NUEUS3W23ZJDANCNFSM6AAAAAAVRS2AXI&data=05%7C01%7C%7C7eba31ea911e493a34f708db1ee924ec%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638137758912999812%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=MA7ai2mgedl5E32QE%2BBog3wnJPRjtWaN5n64Hmt6lGc%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>
Could you point to the original repo so I can diff it ?
We are here. This is the original repo: https://github.com/idiap/fast-transformers/
I've added printf
inside CUDA code.
Here is the diff (sorry, currently I've got some issues with pushing here)
diff --git a/fast_transformers/clustering/hamming/cluster_cuda.cu b/fast_transformers/clustering/hamming/cluster_cuda.cu
index 5549aec..950fde5 100644
--- a/fast_transformers/clustering/hamming/cluster_cuda.cu
+++ b/fast_transformers/clustering/hamming/cluster_cuda.cu
@@ -4,6 +4,8 @@
// Apoorv Vyas <avyas@idiap.ch>
//
+#include <stdio.h>
+#include <inttypes.h>
#include <curand.h>
#include <curand_kernel.h>
#include <torch/extension.h>
@@ -405,6 +407,7 @@ void cluster(
counts,
n_iterations
);
+ printf("Centroids %" PRIx64 " %" PRIx64 "\n", centroids[0][0][0].item<int64_t>(), centroids[0][0][1].item<int64_t>());
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
And here is the result
Centroids 3fffffff 0
Centroids 0 7fffffff
Centroids ffffffffffffffff 0
FCentroids ff 0
.Centroids 20 80
.Centroids ff 0
.
======================================================================
FAIL: test_long_clusters (clustering.hamming.test_cluster_gpu.TestClusterGPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\work\robot_simulation\fast-transformers\tests\clustering\hamming\test_cluster_gpu.py", line 48, in test_long_clusters
self.assertEqual(
AssertionError: Tuples differ: (-1, 0) != (0, 4294967295)
First differing element 0:
-1
0
- (-1, 0)
+ (0, 4294967295)
----------------------------------------------------------------------
Ran 4 tests in 1.292s
FAILED (failures=1)
So, the result is correct inside CUDA kernel, but Python has received wrong values.
EDIT Wrong
CUDA should have 0xFFFFFFFF, but it has 0xffffffffffffffff
@wl2776 "Conversion of a signed type (or any type) to an unsigned type always takes place via reduction modulo one plus the max value of the destination type.", so yeah you're right this is sign casting, let's hunt this down
C99 standard: 6.3.1.3 Signed and unsigned integers
1 When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged. 2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.49) 3 Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised. 49) The rules describe arithmetic on the mathematical value, not the value of a given type of expression.
@wl2776 saw nothing in cuda but bits
mapping to B
and 10
mapping to n_iterations
are passed on to cuda as regular python int
https://github.com/wl2776/fast-transformers/blob/master/tests/clustering/hamming/test_cluster_gpu.py#L45
and cast as C++ int
here
https://github.com/wl2776/fast-transformers/blob/master/fast_transformers/clustering/hamming/cluster_cuda.cu#387
and here
https://github.com/wl2776/fast-transformers/blob/master/fast_transformers/clustering/hamming/cluster_cuda.cu#388
formal unsigned occurences as per code
@BenoitDalFerro
Couldn't the problem be in line 228 of cuda_clustering.cu?
Currently I am traveling in a metro and typing from mobile. Sorry for bad message formatting. 😅
@wl2776 so
mean_k = mean_k | (1LL << i);
instead of mean_k = mean_k | (1L << i);
what about same two lines above L225
should be (curand(state + k) & 1LL);
instead of (curand(state + k) & 1L);
?
point is there are other instances of 1L (long - unsigned) that could be converted to 1LL (long long - signed) for safety purposes
@wl2776 I'm not authorized to push fix to repo
@BenoitDalFerro, I've fixed this issue in my fork. Will submit PR later. For now, you can clone my fork and build it from sources.
@wl2776 very nice, compiles pleasantly, instantiates ibid
@BenoitDalFerro, I've fixed this issue in my fork. Will submit PR later. For now, you can clone my fork and build it from sources.
I cannot find you fork, what happned?
@BenoitDalFerro, I've fixed this issue in my fork. Will submit PR later. For now, you can clone my fork and build it from sources.
I cannot find you fork, what happened?
I want to register it officially with my management and therefore have to complete some formal paperwork.
@BenoitDalFerro, I've fixed this issue in my fork. Will submit PR later. For now, you can clone my fork and build it from sources.
I cannot find you fork, what happened?
I want to register it officially with my management and therefore have to complete some formal paperwork.
Could you temporarily open your fork? I would greatly appreciate it. Thanks~
Hi,
Thanks yes can do later on today sorry but busy
Best regards,
Benoit
On 17 Mar 2023, at 08:51, Suzhen Wang @.***> wrote:
@BenoitDalFerrohttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBenoitDalFerro&data=05%7C01%7C%7Cdd75794770504ced649e08db26bc64ee%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638146362805782101%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Rm%2FG6Gkc5m2%2BmmwlN5xFlPKHI0tXjOnT8zjI7mRLbXg%3D&reserved=0, I've fixed this issue in my fork. Will submit PR later. For now, you can clone my fork and build it from sources.
I cannot find you fork, what happened?
I want to register it officially with my management and therefore have to complete some formal paperwork.
Could you temporarily open your fork? I would greatly appreciate it. Thanks~
— Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fidiap%2Ffast-transformers%2Fissues%2F121%23issuecomment-1473313384&data=05%7C01%7C%7Cdd75794770504ced649e08db26bc64ee%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638146362805782101%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wQjZJIJF4JTbLAjcL4r62%2FndvCk1%2BWVBNUzCCuYqP5Y%3D&reserved=0, or unsubscribehttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQTXJEWDS67HVABOCEM3573W4QJXNANCNFSM6AAAAAAVRS2AXI&data=05%7C01%7C%7Cdd75794770504ced649e08db26bc64ee%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638146362805782101%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2F%2F7vtmBWnOw%2F8UfXOVQPs%2Bkti264gahZPRM5S7aPQCY%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
There is another fail with a cuda dot product if I remember correctly, I'll file an issue
On 17 Mar 2023, at 08:48, wl2776 @.***> wrote:
@BenoitDalFerrohttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBenoitDalFerro&data=05%7C01%7C%7Cc34dd21c0ea044c1f5cc08db26bbee29%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638146360810479671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BrE1edwH2AgknlZ2jMX0qyD63%2FOxf%2FRh1TzDWjGXQvY%3D&reserved=0, I've fixed this issue in my fork. Will submit PR later. For now, you can clone my fork and build it from sources.
I cannot find you fork, what happened?
I want to register it officially with my management and therefore have to complete some formal paperwork.
— Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fidiap%2Ffast-transformers%2Fissues%2F121%23issuecomment-1473310037&data=05%7C01%7C%7Cc34dd21c0ea044c1f5cc08db26bbee29%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638146360810479671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oD6CuT9mk5UiGUR1Zps18YYXP3MZDOzpzYXpqha2kGA%3D&reserved=0, or unsubscribehttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAQTXJESRFB5MWNGXESDF7UTW4QJK7ANCNFSM6AAAAAAVRS2AXI&data=05%7C01%7C%7Cc34dd21c0ea044c1f5cc08db26bbee29%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638146360810479671%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jKuyW9qqo2R3xH9IhvhyuB1XYzXFzWG4DxKwy5P2gys%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
@BenoitDalFerro, I've fixed this issue in my fork. Will submit PR later. For now, you can clone my fork and build it from sources.
I cannot find you fork, what happened?
I want to register it officially with my management and therefore have to complete some formal paperwork.
Could you temporarily open your fork? I would greatly appreciate it. Thanks~
I've opened it and submitted PR. BTW, github doesn't allow to hide forks of public repos, so actually I've deleted it, then created again
Windows 11 system, VS 2022
pip install --user pytorch_fast_transformers