svalkiers / clusTCR

CDR3 clustering module providing a new method for fast and accurate clustering of large data sets of CDR3 amino acid sequences, and offering functionalities for downstream analysis of clustering results.
Other
50 stars 9 forks source link

Examples in 'Clustering' raises different errors #50

Open ustervbo opened 1 year ago

ustervbo commented 1 year ago

It seems that clustering with any method other than faiss makes the clustering halt with the error 0-dimensional array given. Array must be at least two-dimensional when using Python 3.11.4 and the latest conda version of clustcr.

Also, the example in Clustering/Usage:

clustering = Clustering()
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3)

fails with the error Wrong input. Please provide an iterable object containing CDR3 amino acid sequences.. This is irrespective of the python version.

It seems, that fit() ignores the cdr3_col argument if include_vgene=False, as this works:

clustering = Clustering()
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3, include_vgene=True, cdr3_col="junction_aa", v_gene_col="v_call")

but this fails:

clustering = Clustering()
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3, include_vgene=False, cdr3_col="junction_aa")

Here's a complete example

#!/usr/bin/env python3

from clustcr import Clustering, datasets

# This works
clustering = Clustering(method='faiss')
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3['junction_aa'])
output = clustering.fit(cdr3, include_vgene=True, cdr3_col="junction_aa", v_gene_col="v_call")

data = datasets.vdjdb_paired()
cdr3, alpha = data['CDR3_beta'], data['CDR3_alpha']
output = clustering.fit(cdr3, alpha=alpha)

# This fails with 'Wrong input. Please provide an iterable object containing CDR3 amino acid sequences.'
clustering = Clustering()
cdr3 = datasets.test_cdr3()
output = clustering.fit(cdr3)

# MCL and two-step methods both fail with '0-dimensional array given. Array must be at least two-dimensional'
mcl_clustering = Clustering(method='mcl')
output = mcl_clustering.fit(cdr3)

ts_clustering = Clustering(method='two-step')
output = ts_clustering.fit(cdr3)

It works under Python 3.10.12 and the latest conda version of clustcr (the clustering completes with all methods), though SciPy complains: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.25.1.

I used the following commands to create the conda environments:

conda create -n clustcr python clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge
conda create -n clustcr3-10 python=3.10 clustcr -c svalkiers -c bioconda -c pytorch -c conda-forge
svalkiers commented 1 year ago

Thanks for raising this issue. It seems like the introduction of the V gene clustering functionality has broken some of the original examples. I hope to fix this in the next release.