phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
79 stars 18 forks source link

Error in Aggregate the individual GEX runs into a single AnnData object #52

Open QianhuiXu opened 1 year ago

QianhuiXu commented 1 year ago

Hello,

conga is a wonderful tool!

I ran into an issue with explore fancy_conga_pipeline_with_batches_and_gammadelta_tcrs notebook.

My command : gex_datasets = sorted(glob.glob('*-CD3')) diseases = ['C','NC','CT'] # colitis, no-colitis, healthy control contigs_file = '/home/shpc_100668/conga/GSE144469_RAW/GSE144469_TCR_filtered_contig_annotations_all.csv' all_contigs = pd.read_csv(contigs_file) all_data = [] for donor_num, gex_dir in enumerate(gex_datasets):

The folder name is also the donor ID

donor = gex_dir.split('-')[0]
donor_contigs = all_contigs[all_contigs.barcode.str.endswith(donor)].copy()
# change the barcode suffix to '-1' to match the GEX data
donor_contigs['barcode'] = donor_contigs.barcode.str.split('-').str.get(0)+'-1'
donor_contigs_file = f'{donor}_abtcr_filtered_contigs.csv'
donor_contigs.to_csv(donor_contigs_file)
# process the contigs to generate conga clonotypes
donor_clones_file = f'{donor}_abtcr_clones.tsv'
make_10x_clones_file(
    donor_contigs_file, 
    organism = 'human', # using 'human' for TCRab
    clones_file = donor_clones_file, 
    stringent = True, # (the default) see Note #1 on clonotype filtering 
)
# read the GEX data and the clonotypes into CoNGA
adata = conga.preprocess.read_dataset(
    gex_dir, '10x_mtx', donor_clones_file,
    allow_missing_kpca_file=True)
disease = donor[:-1]
adata.obs['disease'] = disease
adata.obs['disease_int'] = diseases.index(disease) # conga batch ids are integers
adata.obs['donor'] = donor
adata.obs['donor_int'] = donor_num # conga batch ids are integers
all_data.append( adata )

new_adata = all_data[0].concatenate(all_data[1:]) new_adata.write('merged_gex_abtcr.h5ad')

Error: IndexError Traceback (most recent call last) /tmp/ipykernel_1354605/1967687937.py in 33 34 # concatenate the datasets ---> 35 new_adata = all_data[0].concatenate(all_data[1:]) 36 #save the aggregate AnnData object 37 new_adata.write('merged_gex_abtcr.h5ad')

IndexError: list index out of range

I'm really at a loss as to how to proceed, and any guidance would be much appreciated! Thank you for your kind help!

phbradley commented 1 year ago

Hi there, thanks for trying conga, and thanks for the feedback. This error suggests that the list "all_data" is empty, which may be because the preceding loop did not execute. The loop was over the files found by the glob command

gex_datasets = sorted(glob.glob('*-CD3'))

Could you check and see whether the expected files are present and in the directory where the notebook is running? These would be the *-CD3 folders that have the GEX counts data in them.

QianhuiXu commented 1 year ago

Thank you for your help! I have solved this error by changing the reading directory: gex_datasets = sorted(glob.glob('/home/shpc_100668/conga/GSE144469_RAW/-CD3')) But I got another issue in the next step, I have put these -gdTCR_filtered_contig_annotations.csv files in the reading directory('/home/shpc_100668/conga/GSE144469_RAW/').

My command : gex_datasets = sorted(glob.glob('/home/shpc_100668/conga/GSE144469_RAW/*-CD3')) diseases = ['C','NC','CT'] # colitis, no-colitis, healthy control contigs_file = '/home/shpc_100668/conga/GSE144469_RAW/GSE144469_TCR_filtered_contig_annotations_all.csv' all_contigs = pd.read_csv(contigs_file) all_data = [] for donor_num, gex_dir in enumerate(gex_datasets): donor = gex_dir.split('-')[0] donor_contigs = all_contigs[all_contigs.barcode.str.endswith(donor)].copy() donor_contigs['barcode'] = donor_contigs.barcode.str.split('-').str.get(0)+'-1' donor_contigs_file = f'{donor}_abtcr_filtered_contigs.csv' donor_contigs.to_csv(donor_contigs_file) donor_clones_file = f'{donor}_abtcr_clones.tsv' make_10x_clones_file( donor_contigs_file, organism = 'human', # using 'human' for TCRab clones_file = donor_clones_file, stringent = True, # (the default) see Note #1 on clonotype filtering ) adata = conga.preprocess.read_dataset( gex_dir, '10x_mtx', donor_clones_file, allow_missing_kpca_file=True) disease = donor[:-1] adata.obs['disease'] = disease adata.obs['disease_int'] = diseases.index(disease) # conga batch ids are integers adata.obs['donor'] = donor adata.obs['donor_int'] = donor_num all_data.append( adata ) new_adata = all_data[0].concatenate(all_data[1:]) new_adata.write('merged_gex_abtcr.h5ad')

error: ab_counts: [] old_unpaired_barcodes: 0 old_paired_barcodes: 0 new_stringent_paired_barcodes: 0 reading: /home/shpc_100668/conga/GSE144469_RAW/C1-CD3 of type 10x_mtx total barcodes: 3862 (3862, 33538) reading: /home/shpc_100668/conga/GSE144469_RAW/C1_abtcr_clones.tsv WARNING: missing kpca_file: /home/shpc_100668/conga/GSE144469_RAW/C1_abtcr_clones_AB.dist_50_kpcs WARNING: X_tcr_pca will be empty Reducing to the 0 barcodes (out of 3862) with paired TCR sequence data /home/shpc_100668/conga/conga/preprocess.py:233: DeprecationWarning: Use is_view instead of isview, isview will be removed in the future. if adata.isview: # ran into trouble with AnnData views vs copies

AttributeError Traceback (most recent call last) /tmp/ipykernel_2715303/7264258.py in 23 adata = conga.preprocess.read_dataset( 24 gex_dir, '10x_mtx', donor_clones_file, ---> 25 allow_missing_kpca_file=True) 26 disease = donor[:-1] 27 adata.obs['disease'] = disease

~/conga/conga/preprocess.py in read_dataset(gex_data, gex_data_type, clones_file, make_var_names_unique, keep_cells_without_tcrs, kpca_file, allow_missing_kpca_file, gex_only, suffix_for_non_gene_features) 403 404 tcrs = [ barcode2tcr[x] for x in adata.obs.index ] --> 405 store_tcrs_in_adata( adata, tcrs ) 406 407 return adata

~/conga/conga/preprocess.py in store_tcrs_in_adata(adata, tcrs) 178 179 # ensure lower case --> 180 adata.obs['cdr3a_nucseq'] = adata.obs.cdr3a_nucseq.str.lower() 181 adata.obs['cdr3b_nucseq'] = adata.obs.cdr3b_nucseq.str.lower() 182

~/anaconda3/envs/conga4/lib/python3.7/site-packages/pandas/core/generic.py in getattr(self, name) 5485 ): 5486 return self[name] -> 5487 return object.getattribute(self, name) 5488 5489 def setattr(self, name: str, value) -> None:

~/anaconda3/envs/conga4/lib/python3.7/site-packages/pandas/core/accessor.py in get(self, obj, cls) 179 # we're accessing the attribute of the class, i.e., Dataset.geo 180 return self._accessor --> 181 accessor_obj = self._accessor(obj) 182 # Replace the property with the accessor object. Inspired by: 183 # https://www.pydanny.com/cached-property.html

~/anaconda3/envs/conga4/lib/python3.7/site-packages/pandas/core/strings/accessor.py in init(self, data) 166 from pandas.core.arrays.string_ import StringDtype 167 --> 168 self._inferred_dtype = self._validate(data) 169 self._is_categorical = is_categorical_dtype(data.dtype) 170 self._is_string = isinstance(data.dtype, StringDtype)

~/anaconda3/envs/conga4/lib/python3.7/site-packages/pandas/core/strings/accessor.py in _validate(data) 223 224 if inferred_dtype not in allowed_types: --> 225 raise AttributeError("Can only use .str accessor with string values!") 226 return inferred_dtype 227

AttributeError: Can only use .str accessor with string values!

Thank you for your kind help!