taokz / BiomedGPT

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks
https://www.nature.com/articles/s41591-024-03185-2
Apache License 2.0
476 stars 53 forks source link

RuntimeError: CUDA error: device-side assert triggered #14

Open nghiemkythu opened 8 months ago

nghiemkythu commented 8 months ago

Hello, thank you very much for your help in previous questions.

I have successfully pre-trained this model on my own medical dataset. However, in the process of training, when the input data have the answer too long, the model will have the error (with all data have short answer, the model does not have this error):

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

An example of an answer sentence which is long is: "The ratio of lumen coalescence to cell division times controls the monolumen or multilumen phenotypes. (A) We treated cysts with aPKC-PSi to induce polarity disruption and spatially disordered spindle positioning. Cells were seeded in growth medium supplemented with 50 \u00b5M aPKC-PSi, and fixed 3 d later. Confocal scans for cysts treated with aPKC-PSi, which lack control over mitotic spindle positioning; stained for \u03b2-catenin (green), labeling cell membranes; and ZO-1 (white) and PCX/gp135 (red), labeling luminal interfaces. Notice the multilumen morphology. Time from seeding: 3 d. Bar, 10 \u00b5m. (B) 3D reconstructions of the cyst in A from the full set of confocal sections. Several disconnected lumens are visible in the cross-section (right). Bar, 10 \u00b5m. (C) Mean number of luminal volumes and fraction of cysts with single lumen/multiple lumen phenotypes (inset), obtained from numerical simulations as a function of the tightness of the control over orientation of the cleavage plane. \u03d5 = 0\u00b0, mitotic planes always orthogonal to the luminal surface. \u03d5 = 90\u00b0, division planes chosen isotropically in three dimensions. Note that even for \u03d5 = 0\u00b0, it is still possible to get more than one lumen. Simulated cysts have 32 cells. Error bars indicate standard deviations. (D) Numerical simulation of a multiple-lumen phenotype. The cleavage plane has been chosen at random, isotropically in 3D space. Equivalent time from seeding is around 5 d. (E) Mean fraction of cysts with single lumen/multiple lumen phenotypes (red/blue), obtained from numerical simulations as a function of the duplication time \u03c4, for different levels of the control over the orientation of the cleavage plane. The dynamics of several cysts were simulated for four rounds of division, and the fraction of normal cysts in the population was measured at the end of the simulation (16 cells), corresponding approximately to day 3 in the experiments. The duplication time is rescaled with \u03c40, which corresponds to normal MDCK cysts (20 degrees of error on spindle positioning and 30% of normal phenotype). For \u03c4/\u03c40 \u226b 1 (cell division slower than control), eventually 100% of the cysts are normal, regardless of cell division orientation. For \u03c4/\u03c40, slightly smaller than unity, the emergence of aberrant phenotypes is strongly increased, even in the case of perfect spindle positioning. (F) We numerically simulated the relaxation of a multiluminal cyst grown with aberrant spindle positioning to the monolumen configuration, with blocked cell divisions. Here we plot the time needed to reach the monolumen rescaled by the division time as a function of the number of cells, ncells. As expected, this increases with the size of the aggregates, as more microlumens are generated. (G) Cartoon illustrating how the process of lumen coalescence can take place with no alteration in the total extent of lateral surfaces, and therefore without changing the energy defined in Eqs. 1 and 2. (H) To show the effect of cell division on lumen coalescence in numerical simulations, we grew cysts starting from a single cell with different division rates. Cysts are grown with normal, i.e., correctly oriented, cell divisions. During time, cells divide and form microlumens that tend to coalesce. Faster cell division rates do not allow lumens to coalesce, independently of polarity. Error bars indicate standard deviations. (I) Aphidicolin treatment rescues the multilumen phenotype induced by aPKC perturbation. To check whether the predictions of our model on the slowdown of cell division were correct, we fixed cysts at day 4 and experimented with several treatments. Cysts were grown with aPKC-PSi from seeding (aPKC-PSi), with both aPKC-PSi and Aphidicolin (aPKC-PSi+Aph), or were treated with aPKC-PSi for 2 d, then washed and either left untreated (aPKC-PSi WO d2) or treated with Aphidicolin (aPKC-PSi WO+Aph). Each percentage is obtained by means of the indicated number of independent experiments, for a total number of analyzed cysts reported in the top y axis. aPKC-PSi\u2013treated cysts show a significant decrease in monoluminal cysts (aPKC-PSi [52.2 \u00b1 6.1] vs. control [73.4 \u00b1 3.8]; p-value, 1.5 \u00d7 10\u221212). Quadruplicate experiments of aPKC-PSi treatment and Aphidicolin show a significant shift toward the control situation, i.e., the number of monolumens increases (aPKC-PSi vs. aPKC-PSi+Aph [66.1 \u00b1 1.4]; p-value, 9.8 \u00d7 10\u221210). Washing out aPKC-PSi restores the normal level of mono- versus multiluminal cysts (aPKC-PSi vs. aPKC-PSi WO [67.0 \u00b1 3.7]; p-value, 7.5 \u00d7 10\u22127). Aphidicolin given for 2 d after the washout was similar to the washout alone (aPKC-PSi vs. aPKC-PSi WO+Aph [66.2 \u00b1 4.7]; p-value, 1.5 \u00d7 10\u22125). Statistical significance was assessed by means of a two-tailed Fisher test on 2 \u00d7 2 contingency tables containing mono- and multilumen counts and indicated treatments. Values are indicated as percentage means of the independent experiments \u00b1 standard deviation. N indicates total number of analyzed cysts and n indicates number of biological replicates; n = 2 + 2 indicates two biological replicates and two technical replicates. In the legend, \u201cMixed polarity\u201d indicates cysts with gp135/PCX localized to the exterior surface of the cyst, where the cell membrane contacts the ECM. \u201cLumen\u201d refers to cysts with centrally localized lumens, scored on the basis of the localization of gp135/PCX. \u201cAMIS\u201d refers to cysts that exhibit an accumulation of gp135/PCX in vesicular structures that underlie the membrane where the lumen will develop."

I think because the answer is too long and out of bounds index in the embedding matrix. I do not know which setting I can change to fix this problem. I hope you can help me to fix it.

I am so sorry because I asked you too many questions. Thank you very much for your help in previous questions.

taokz commented 8 months ago

Hi @nghiemkythu

I'm not certain if I've encountered a similar issue before, but I think the problem may not be related to the length of the input because I have previously worked with long inputs without issue. Instead, the problem might be due to specific symbols in the input that the model cannot process properly. It would be worthwhile to check for symbols like "\u", "\t" (tab space symbol) and characters that are incompatible with UTF-8 encoding. Generally, these symbols may not be visible in a text editor. However, by reading the file and printing its contents, you can identify them. I hope this suggestion helps, and I look forward to hearing positive updates from you.

Please don't hesitate to reach out to me; I'm more than happy to assist. Having faced my own challenges in debugging and enhancing performance, I understand how frustrating it can be. I'm hopeful that my experience can be of benefit to your project

nghiemkythu commented 8 months ago

Hello, I see that when I do not use very long sentence, the model run well. However, when I use these long sentences, I will have the error "The size of tensor a (1026) must match the size of tensor b (1024) at non-singleton dimension 3" or "CUDA error: device-side assert triggered"

When I change the parameter "length" in file "ofa_dataset.py" (line 31) to 1022, everything run well.

I do not know whether I should fix the parameter length in that file (=1022). In the original version, this parameter is equal to None.