issue with metaclustering and airr data

guillemsanchezsanchez1996 commented 1 year ago

Hello Sebastian and co.

Thanks a lot for designing this nice package to understand the nature of TCR repertoire and potential expansions. My goal with the current airr data I have is to compare differences in clusters between different subjects and I think your batch approach can be really useful for this objective.

I have been following some of the sections in your docs document but unfortunately I am stuck with the demo for clustering a set of repertoires simultaneously. The main issue is with the metarepertoire function. Here is the error:

In [14]: training_sample_size = round(1000 * (total_cdr3s / 5000)) ...: training_sample = metarepertoire(directory=datadir, ...: data_format='airr', ...: n_sequences=training_sample_size) ...:

TypeError Traceback (most recent call last) Cell In[14], line 2 1 training_sample_size = round(1000 * (total_cdr3s / 5000)) ----> 2 training_sample = metarepertoire(directory=datadir, 3 data_format='airr', 4 n_sequences=training_sample_size)

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/clustcr/input/datasets.py:65, in metarepertoire(directory, data_format, out_format, n_sequences) 63 meta = pd.concat([meta, parse_immuneaccess(file, out_format=out_format)]) 64 elif data_format.lower()=='airr': ---> 65 meta = pd.concat([meta, parse_airr(file)]) 66 elif data_format.lower()=='tcrex': 67 meta = pd.concat([meta, parse_tcrex(file)])

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, *kwargs) 325 if len(args) > num_allow_args: 326 warnings.warn( 327 msg.format(arguments=_format_argument_list(allow_args)), 328 FutureWarning, 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(args, **kwargs)

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/core/reshape/concat.py:368, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy) 146 @deprecate_nonkeyword_arguments(version=None, allowed_args=["objs"]) 147 def concat( 148 objs: Iterable[NDFrame] | Mapping[HashableT, NDFrame], (...) 157 copy: bool = True, 158 ) -> DataFrame | Series: 159 """ 160 Concatenate pandas objects along a particular axis. 161 (...) 366 1 3 4 367 """ --> 368 op = _Concatenator( 369 objs, 370 axis=axis, 371 ignore_index=ignore_index, 372 join=join, 373 keys=keys, 374 levels=levels, 375 names=names, 376 verify_integrity=verify_integrity, 377 copy=copy, 378 sort=sort, 379 ) 381 return op.get_result()

File ~/miniconda3/envs/clustcr/lib/python3.9/site-packages/pandas/core/reshape/concat.py:458, in _Concatenator.init(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort) 453 if not isinstance(obj, (ABCSeries, ABCDataFrame)): 454 msg = ( 455 f"cannot concatenate object of type '{type(obj)}'; " 456 "only Series and DataFrame objs are valid" 457 ) --> 458 raise TypeError(msg) 460 ndims.add(obj.ndim) 462 # get the sample 463 # want the highest ndim that we have, and must be non-empty 464 # unless all objs are empty

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid**

I think the main issue is that airr files are not loaded as pd dataframe. See this code as an example:

**data = read_cdr3('/mnt/c/Users/usuari/Desktop/mixcr-4.1.2/clustcr/output_TRB_SP_135.tsv', data_form ...: at='airr')

In [25]: data Out[25]: array(['CASSQGFGTQYF', 'CASSQSQYAEQFF', 'CASSRGAADTLYF', ..., 'SASSLGQNNSPLHF', 'SASSSYEQHF', 'RGHTGQLYF'], dtype=object)**

Do you have an idea about which can be the problem?

All my best,

Guillem Sanchez

svalkiers commented 1 year ago

Hi Guillem,

Thanks for using ClusTCR. We are very sorry about this inconvenience. I believe your assumption is correct, the parse_airr function provides an numpy.array as output, instead of a pandas.Series. I will try to resolve the issue as soon as possible.

guillemsanchezsanchez1996 commented 1 year ago

Thanks a lot Sebastiaan for your fast answer!

Looking forward for your help to solve this issue :)

Guillem

guillemsanchezsanchez1996 commented 1 year ago

Ups sorry I have closed the issue by mistake!

guillemsanchezsanchez1996 commented 1 year ago

Hi Guillem,

Thanks for using ClusTCR. We are very sorry about this inconvenience. I believe your assumption is correct, the parse_airr function provides an numpy.array as output, instead of a pandas.Series. I will try to resolve the issue as soon as possible.

Hello Sebastiaan, by any chance did you have some time to solve this issue?

Thanks again for your help,

Guillem

svalkiers / clusTCR

issue with metaclustering and airr data #49

In [14]: training_sample_size = round(1000 * (total_cdr3s / 5000)) ...: training_sample = metarepertoire(directory=datadir, ...: data_format='airr', ...: n_sequences=training_sample_size) ...: