Open Albert-Shuai opened 2 years ago
I am also getting a "MemoryError" when trying to run GIANA4.1.py on a file with 1,744,942 sequences. My file does contain V gene names in IMGT format. I am running the job on a Linux cluster using SLURM resource manager and have allocated 512 GB of RAM and 12 CPUs. Jobs with fewer sequences run successfully. The tutorial data file also runs successfully. I have tried running it with and without non-exact mode (-e
).
python GIANA4.1.py -b -e -f $FILE
Creating CDR3 list
---Process CDR3s with length 18 ---
Performing CDR3 encoding
The number of sequences is 1744942
Done! Total time elapsed 95.277371
==========================================================================
type I break
Handling identical CDR3 groups
Done! Total time elapsed 10224.895391
Matching variable genes
Traceback (most recent call last):
File "GIANA4.1.py", line 1257, in <module>
main()
File "GIANA4.1.py", line 1253, in main
EncodeRepertoire(ff, OutDir, OutFile, ST=ST, thr_s=thr_s, thr_v=thr_v, exact=EE,VDict=VScore, Vgene=VV, thr_iso=cutoff, gap=Gap, GPU=GPU, Mat=Mat, verbose=verbose)
File "GIANA4.1.py", line 852, in EncodeRepertoire
vCL=IdentifyMotifCluster(sMat)
File "GIANA4.1.py", line 598, in IdentifyMotifCluster
STACK=dfs(SSG,ii)
File "GIANA4.1.py", line 581, in dfs
stack.extend(set(graph[vertex]) - visited)
MemoryError
Could you please make sure that you only have 17K sequences input? According to GIANA message, there are 1.7M sequences for length 18 alone.
Best, Bo
From: David Coffey, MD @.> Date: Saturday, February 10, 2024 at 9:48 PM To: s175573/GIANA @.> Cc: Subscribed @.***> Subject: Re: [s175573/GIANA] Memory error (Issue #2) EXTERNAL MAIL
I am also getting a "MemoryError" when trying to run GIANA4.1.py on a file with 17,194 sequences. My file does contain V gene names in IMGT format. I am running the job on a Linux cluster using SLURM resource manager and have allocated 512 GB of RAM and 12 CPUs. Jobs with fewer sequences run successfully. The tutorial data file also runs successfully. I have tried running it with and without non-exact mode (-e).
python GIANA4.1.py -b -e -f $FILE
Creating CDR3 list
---Process CDR3s with length 18 ---
Performing CDR3 encoding
The number of sequences is 1744942
Done! Total time elapsed 95.277371
==========================================================================
type I break
Handling identical CDR3 groups
Done! Total time elapsed 10224.895391
Matching variable genes
Traceback (most recent call last):
File "GIANA4.1.py", line 1257, in
main()
File "GIANA4.1.py", line 1253, in main
EncodeRepertoire(ff, OutDir, OutFile, ST=ST, thr_s=thr_s, thr_v=thr_v, exact=EE,VDict=VScore, Vgene=VV, thr_iso=cutoff, gap=Gap, GPU=GPU, Mat=Mat, verbose=verbose)
File "GIANA4.1.py", line 852, in EncodeRepertoire
vCL=IdentifyMotifCluster(sMat)
File "GIANA4.1.py", line 598, in IdentifyMotifCluster
STACK=dfs(SSG,ii)
File "GIANA4.1.py", line 581, in dfs
stack.extend(set(graph[vertex]) - visited)
MemoryError
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/s175573/GIANA/issues/2*issuecomment-1937402410__;Iw!!MznTZTSvDXGV0Co!DpkEfvrRXjEFjlEWMyKtvNsbYRgfxl5gYv1zAMy4UQ6stdResJO7hMEQ8a1BMSWj0M6U6YgUpYnb5Di1_jOj4ClC2BToYwI$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AKWYQC3XXJEPK4CYS2RTYQTYTAWPBAVCNFSM5X5NJZU2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJTG42DAMRUGEYA__;!!MznTZTSvDXGV0Co!DpkEfvrRXjEFjlEWMyKtvNsbYRgfxl5gYv1zAMy4UQ6stdResJO7hMEQ8a1BMSWj0M6U6YgUpYnb5Di1_jOj4ClCV9TS1lY$. You are receiving this because you are subscribed to this thread.Message ID: @.***> CAUTION: This email originated from outside UTSW. Please be cautious of links or attachments, and validate the sender's email address before replying.
UT Southwestern
Medical Center
The future of medicine, today.
Yes, that was a type-o. In the example above, I have 1.7M sequences, each 18 amino acids in length. Since I noticed clustering is done on equal-length CDR3 sequences, I chose to subset up my entire dataset into separate files with sequences of the same size and am running each file in parallel (I have 65M sequences in total). I did attempt to run the entire dataset through GIANA with 1.5TB of allocated memory, and after 3 weeks, there has been no progress.
Hi Dr. Bo Li:
I am trying to use GIANA-4.1 to perform cluster on around 3000,000 sequences, without vdj gene information. I request 128G memory, but it still gives:
Traceback (most recent call last): File "/users/sli1/GIANA-4.1/GIANA4.1.py", line 1255, in
main()
File "/users/sli1/GIANA-4.1/GIANA4.1.py", line 1251, in main
EncodeRepertoire(ff, OutDir, OutFile, ST=ST, thr_s=thr_s, thr_v=thr_v, exact=EE,VDict=VScore, Vgene=VV, thr_iso=cutoff, gap=Gap, GPU=GPU, Mat=Mat, verbose=verbose)
File "/users/sli1/GIANA-4.1/GIANA4.1.py", line 919, in EncodeRepertoire
SSGnew=UpdateSSG(SSG, CDR3s, tmpVgenes, Vscore=VDict, cutoff=thr_s+4)
File "/users/sli1/GIANA-4.1/GIANA4.1.py", line 548, in UpdateSSG
N=len(list(chain(*list(SSG.values()))))
MemoryError
In the paper you mentioned that GIANA is fast on 10^7 scale sequence clustering, thus may I learn that how many memory shall I apply for for my task. Thanks!