svalkiers / clusTCR

CDR3 clustering module providing a new method for fast and accurate clustering of large data sets of CDR3 amino acid sequences, and offering functionalities for downstream analysis of clustering results.
Other
47 stars 9 forks source link

issue with metaclustering and airr data #49

Open guillemsanchezsanchez1996 opened 1 year ago

guillemsanchezsanchez1996 commented 1 year ago

Hello Sebastian and co.

Thanks a lot for designing this nice package to understand the nature of TCR repertoire and potential expansions. My goal with the current airr data I have is to compare differences in clusters between different subjects and I think your batch approach can be really useful for this objective.

I have been following some of the sections in your docs document but unfortunately I am stuck with the demo for clustering a set of repertoires simultaneously. The main issue is with the metarepertoire function. Here is the error:

In [14]: training_sample_size = round(1000 * (total_cdr3s / 5000)) ...: training_sample = metarepertoire(directory=datadir, ...: data_format='airr', ...: n_sequences=training_sample_size) ...:

TypeError Traceback (most recent call last) Cell In[14], line 2 1 training_sample_size = round(1000 * (total_cdr3s / 5000)) ----> 2 training_sample = metarepertoire(directory=datadir, 3 data_format='airr', 4 n_sequences=training_sample_size)

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/clustcr/input/datasets.py:65, in metarepertoire(directory, data_format, out_format, n_sequences) 63 meta = pd.concat([meta, parse_immuneaccess(file, out_format=out_format)]) 64 elif data_format.lower()=='airr': ---> 65 meta = pd.concat([meta, parse_airr(file)]) 66 elif data_format.lower()=='tcrex': 67 meta = pd.concat([meta, parse_tcrex(file)])

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, *kwargs) 325 if len(args) > num_allow_args: 326 warnings.warn( 327 msg.format(arguments=_format_argument_list(allow_args)), 328 FutureWarning, 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(args, **kwargs)

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/core/reshape/concat.py:368, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy) 146 @deprecate_nonkeyword_arguments(version=None, allowed_args=["objs"]) 147 def concat( 148 objs: Iterable[NDFrame] | Mapping[HashableT, NDFrame], (...) 157 copy: bool = True, 158 ) -> DataFrame | Series: 159 """ 160 Concatenate pandas objects along a particular axis. 161 (...) 366 1 3 4 367 """ --> 368 op = _Concatenator( 369 objs, 370 axis=axis, 371 ignore_index=ignore_index, 372 join=join, 373 keys=keys, 374 levels=levels, 375 names=names, 376 verify_integrity=verify_integrity, 377 copy=copy, 378 sort=sort, 379 ) 381 return op.get_result()

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/core/reshape/concat.py:458, in _Concatenator.init(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort) 453 if not isinstance(obj, (ABCSeries, ABCDataFrame)): 454 msg = ( 455 f"cannot concatenate object of type '{type(obj)}'; " 456 "only Series and DataFrame objs are valid" 457 ) --> 458 raise TypeError(msg) 460 ndims.add(obj.ndim) 462 # get the sample 463 # want the highest ndim that we have, and must be non-empty 464 # unless all objs are empty

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid**

I think the main issue is that airr files are not loaded as pd dataframe. See this code as an example:

**data = read_cdr3('/mnt/c/Users/usuari/Desktop/mixcr-4.1.2/clustcr/output_TRB_SP_135.tsv', data_form ...: at='airr')

In [25]: data Out[25]: array(['CASSQGFGTQYF', 'CASSQSQYAEQFF', 'CASSRGAADTLYF', ..., 'SASSLGQNNSPLHF', 'SASSSYEQHF', 'RGHTGQLYF'], dtype=object)**

Do you have an idea about which can be the problem?

All my best,

Guillem Sanchez

svalkiers commented 1 year ago

Hi Guillem,

Thanks for using ClusTCR. We are very sorry about this inconvenience. I believe your assumption is correct, the parse_airr function provides an numpy.array as output, instead of a pandas.Series. I will try to resolve the issue as soon as possible.

guillemsanchezsanchez1996 commented 1 year ago

Thanks a lot Sebastiaan for your fast answer!

Looking forward for your help to solve this issue :)

Guillem

guillemsanchezsanchez1996 commented 1 year ago

Ups sorry I have closed the issue by mistake!

guillemsanchezsanchez1996 commented 1 year ago

Hi Guillem,

Thanks for using ClusTCR. We are very sorry about this inconvenience. I believe your assumption is correct, the parse_airr function provides an numpy.array as output, instead of a pandas.Series. I will try to resolve the issue as soon as possible.

Hello Sebastiaan, by any chance did you have some time to solve this issue?

Thanks again for your help,

Guillem