Open richardshuai opened 1 year ago
@richardshuai is it possible to share the dataset? I can not see that anything is wrong from the log.
Also I just implemented an alignment mode that considers only the structure and not the amino acids. I recommend using this instead, just add --alignment-type 0
to your clustering command.
Thank you for looking into this and for adding the structure-only based clustering option — while I am still getting the error with this option, it is definitely more convenient. Sorry for the late response on the dataset, I had to figure out a way of uploading it.
The exact dataset I'm using is now available on Zenodo here. It is a dataset of a little over 1.3 million antibody structure predictions, extracting just the backbone atoms of 6 CDR loops (with some anchor residues from the framework at the ends of each CDR). The PDBs are split into 14 different .tar.gz files, each with 100K PDBs each (the last one has fewer than 100K). Let me know if you need more information about the dataset and I am happy to provide them. So far, -s 4.0
seems to work without segfaulting and clusters these PDBs well, but I'd prefer to be able to run with higher sensitivities as well. Thank you!
I am attempting to cluster a dataset of ~1.3M backbone-only structures (so all residues are "glycine"), each about 70 residues long. I have made sure all PDBs are well-formed (i.e. have all 4 backbone atoms in each residue and non-empty). I'm running foldseek cluster using
--similarity-type 1
,--tmscore_threshold 0.99
,-c 0.99
, and--cluster_reassign
.k
is automatically determined to be 6.I'm not exactly sure the reason, but it seems like whenever the prefiltering step encounters a large number of kmers, it leads to a segfault (Error: Prefilter step 1 died). Using -s 1.0 keeps the number of entries smaller, and I am successfully able to cluster without running into any segfaults. Furthermore, using the default sensitivity with fewer PDBs (~100K) does succeed -- by testing different subsets of my dataset, it seems to be segfaulting purely based on dataset size. Is there a workaround for this that allows me to keep the sensitivity of the prefiltering step without running into segfaults?
I have tried running the same command with 1024GB of RAM and with 8TB of disk space for tmp, but the same error occurs.
Here is the full output of the segfaulting command: