s175573 / GIANA

Ultrafast TCR clustering algorithm based on geometric isometry
Other
63 stars 30 forks source link

Memory error #2

Open Albert-Shuai opened 2 years ago

Albert-Shuai commented 2 years ago

Hi Dr. Bo Li:

I am trying to use GIANA-4.1 to perform cluster on around 3000,000 sequences, without vdj gene information. I request 128G memory, but it still gives:

Traceback (most recent call last): File "/users/sli1/GIANA-4.1/GIANA4.1.py", line 1255, in main() File "/users/sli1/GIANA-4.1/GIANA4.1.py", line 1251, in main EncodeRepertoire(ff, OutDir, OutFile, ST=ST, thr_s=thr_s, thr_v=thr_v, exact=EE,VDict=VScore, Vgene=VV, thr_iso=cutoff, gap=Gap, GPU=GPU, Mat=Mat, verbose=verbose) File "/users/sli1/GIANA-4.1/GIANA4.1.py", line 919, in EncodeRepertoire SSGnew=UpdateSSG(SSG, CDR3s, tmpVgenes, Vscore=VDict, cutoff=thr_s+4) File "/users/sli1/GIANA-4.1/GIANA4.1.py", line 548, in UpdateSSG N=len(list(chain(*list(SSG.values())))) MemoryError

In the paper you mentioned that GIANA is fast on 10^7 scale sequence clustering, thus may I learn that how many memory shall I apply for for my task. Thanks!

davidcoffey commented 9 months ago

I am also getting a "MemoryError" when trying to run GIANA4.1.py on a file with 1,744,942 sequences. My file does contain V gene names in IMGT format. I am running the job on a Linux cluster using SLURM resource manager and have allocated 512 GB of RAM and 12 CPUs. Jobs with fewer sequences run successfully. The tutorial data file also runs successfully. I have tried running it with and without non-exact mode (-e).

python GIANA4.1.py -b -e -f $FILE

Creating CDR3 list
---Process CDR3s with length 18 ---
 Performing CDR3 encoding
 The number of sequences is 1744942
 Done! Total time elapsed 95.277371
==========================================================================
type I break
     Handling identical CDR3 groups
 Done! Total time elapsed 10224.895391
     Matching variable genes
Traceback (most recent call last):
  File "GIANA4.1.py", line 1257, in <module>
    main()
  File "GIANA4.1.py", line 1253, in main
    EncodeRepertoire(ff, OutDir, OutFile, ST=ST, thr_s=thr_s, thr_v=thr_v, exact=EE,VDict=VScore, Vgene=VV, thr_iso=cutoff, gap=Gap, GPU=GPU, Mat=Mat, verbose=verbose)
  File "GIANA4.1.py", line 852, in EncodeRepertoire
    vCL=IdentifyMotifCluster(sMat)
  File "GIANA4.1.py", line 598, in IdentifyMotifCluster
    STACK=dfs(SSG,ii)
  File "GIANA4.1.py", line 581, in dfs
    stack.extend(set(graph[vertex]) - visited)
MemoryError
s175573 commented 9 months ago

Could you please make sure that you only have 17K sequences input? According to GIANA message, there are 1.7M sequences for length 18 alone.

Best, Bo

From: David Coffey, MD @.> Date: Saturday, February 10, 2024 at 9:48 PM To: s175573/GIANA @.> Cc: Subscribed @.***> Subject: Re: [s175573/GIANA] Memory error (Issue #2) EXTERNAL MAIL

I am also getting a "MemoryError" when trying to run GIANA4.1.py on a file with 17,194 sequences. My file does contain V gene names in IMGT format. I am running the job on a Linux cluster using SLURM resource manager and have allocated 512 GB of RAM and 12 CPUs. Jobs with fewer sequences run successfully. The tutorial data file also runs successfully. I have tried running it with and without non-exact mode (-e).

python GIANA4.1.py -b -e -f $FILE

Creating CDR3 list

---Process CDR3s with length 18 ---

Performing CDR3 encoding

The number of sequences is 1744942

Done! Total time elapsed 95.277371

==========================================================================

type I break

 Handling identical CDR3 groups

Done! Total time elapsed 10224.895391

 Matching variable genes

Traceback (most recent call last):

File "GIANA4.1.py", line 1257, in

main()

File "GIANA4.1.py", line 1253, in main

EncodeRepertoire(ff, OutDir, OutFile, ST=ST, thr_s=thr_s, thr_v=thr_v, exact=EE,VDict=VScore, Vgene=VV, thr_iso=cutoff, gap=Gap, GPU=GPU, Mat=Mat, verbose=verbose)

File "GIANA4.1.py", line 852, in EncodeRepertoire

vCL=IdentifyMotifCluster(sMat)

File "GIANA4.1.py", line 598, in IdentifyMotifCluster

STACK=dfs(SSG,ii)

File "GIANA4.1.py", line 581, in dfs

stack.extend(set(graph[vertex]) - visited)

MemoryError

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/s175573/GIANA/issues/2*issuecomment-1937402410__;Iw!!MznTZTSvDXGV0Co!DpkEfvrRXjEFjlEWMyKtvNsbYRgfxl5gYv1zAMy4UQ6stdResJO7hMEQ8a1BMSWj0M6U6YgUpYnb5Di1_jOj4ClC2BToYwI$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AKWYQC3XXJEPK4CYS2RTYQTYTAWPBAVCNFSM5X5NJZU2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJTG42DAMRUGEYA__;!!MznTZTSvDXGV0Co!DpkEfvrRXjEFjlEWMyKtvNsbYRgfxl5gYv1zAMy4UQ6stdResJO7hMEQ8a1BMSWj0M6U6YgUpYnb5Di1_jOj4ClCV9TS1lY$. You are receiving this because you are subscribed to this thread.Message ID: @.***> CAUTION: This email originated from outside UTSW. Please be cautious of links or attachments, and validate the sender's email address before replying.


UT Southwestern

Medical Center

The future of medicine, today.

davidcoffey commented 9 months ago

Yes, that was a type-o. In the example above, I have 1.7M sequences, each 18 amino acids in length. Since I noticed clustering is done on equal-length CDR3 sequences, I chose to subset up my entire dataset into separate files with sequences of the same size and am running each file in parallel (I have 65M sequences in total). I did attempt to run the entire dataset through GIANA with 1.5TB of allocated memory, and after 3 weeks, there has been no progress.