phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
79 stars 18 forks source link

Question about using previously integrated gex data #28

Closed s2hui closed 2 years ago

s2hui commented 2 years ago

Hello,

In the example on merging multiple data sets, the individual sample clones and companion gex are supplied in a txt file and inputted into the merge_samples.py script.

What I have are the individual clone files and one integrated gex file. What would be appropriate way to go about analyzing my data?

Would it make sense to supply the converted integrated gex data (i.e. mtx or h5 format) as the companion file for each clone in the sample.txt file?

Appreciate your insight, @s2hui

sschattgen commented 2 years ago

Hi,

Assuming the barcode suffixes match between the clones files and the gex matrix, you could run each sample individually since any cells without tcr information would be excluded. But if you’d like to run it all together then I’d recommend using the make_10x_clones_file_batch function (which we need to document better). You can use a metadata csv file with two columns, “file” containing the paths to each of the filtered_contig_annotations.csv files, and “suffix” which should contain the barcode suffix for matching that sample to the corresponding cells in your aggregate gex matrix. This will merge each of the files into a single output clones file that can then be passed to run_conga.py with your gex file.

s2hui commented 2 years ago

Hi, Thanks for this info.

Re: running all together The barcodes in my integrated data set have a prefix (not suffix) i.e. prefix_barcode-1, so I'm guessing the batch function would not work in this case?

Re: running individually I've preprocessed my file so that many cells have been removed. If there are cells (barcode) in the tcr clone file but they aren't in the gex file, would that cause problems? If not could I run individually but supply the integrated gex file for each of the clone files when I run merge_samples.py? This assumes that the clone mapping file contains the matching prefixed- barcodes (i.e. the barcodes in the clones match the barcodes in the integrated gex file)

So the sample.txt file would look like:

clones_file gex_data    gex_data_type
tcr_cloneA.tsv  integrated_mtx_dir  10x_mtx
tcr_cloneB.tsv  integrated_mtx_dir  10x_mtx
tcr_cloneC.tsv  integrated_mtx_dir  10x_mtx
...
sschattgen commented 2 years ago

Hi.

Regarding your second question, I would not recommend using merge_samples.py this way. What I meant previously is you could run run_conga.py using one clones file and the integrated gex file like this: python run_conga.py --gex_data integrated_mtx_dir --gex_data_type 10x_mtx --clones_file tcr_cloneA.tsv --organism human --outfile_prefix ./outdir/prefix

I would still recommend aggregating your clones together first before merging it with your gex file. I've modified the make_10x_clones_batch function (make sure to pull the latest commit) so that either a prefix or suffix can be stripped/appended to the barcode, but it's hard to imagine all the possible configurations and make this all-encompassing so you will need to modify your barcodes either in the GEX or filtered_contig_annotations file prior to parsing.

Assuming the barcodes in the filtered_contigannotations look like "barcode-1", perhaps the simplest way is to modify the barcodes in your gex file by stripping off the '-1' and replacing the "" between the prefix and barcode with "-" so they look like this "prefix-barcode". Then specify the appropriate prefixes in the "batch_id" column (this changed in the new commit) of the metadata file and use the following to get the tcr barcodes into the same configuration:

metadata_file = 'metadata.csv'
organism = 'human'
clones_file = 'clones.tsv' 
make_10x_clones_file_batch( metadata_file,  organism,  clones_file,  strip_batch_id_location = 'suffix',  add_batch_id_location = 'prefix')
s2hui commented 2 years ago

Hi,

Thanks for detailed instructions!

I renamed the barcodes in my integrated Seurat object to be of the format: prefix-barcode (stripped off the -1 at the end and replaced the _ with a - in between the prefix and barcode). Then I converted the object into mtx using the write10xCounts method within R.

Then I made a batch.csv file with two columns (format below):

file,batch_id filtered_contig_annotations.csv,prefix

Then I ran make_10x_clones_file_batch but get the following error:

Python 3.7.2 (default, Dec 29 2018, 06:19:36) [GCC 7.3.0]
Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-centos-7.9.2009-Core
62 logical CPU cores, x86_64
-----
Session information updated at 2021-10-18 17:37

reading: /cluster/projects/finelligroup/scKidneyCancer/out_shirley/conga/out/mtx of type 10x_mtx
total barcodes: 113730 (113730, 34271)
reading: /cluster/projects/finelligroup/scKidneyCancer/out_shirley/conga/out/merged_remedy/merged_remedy_clones.tsv
reading: /cluster/projects/finelligroup/scKidneyCancer/out_shirley/conga/out/merged_remedy/merged_remedy_clones_AB.dist_50_kpcs
Reducing to the 0 barcodes (out of 113730) with paired TCR sequence data
Traceback (most recent call last):
  File "/cluster/home/hshirley/conga/scripts/run_conga.py", line 378, in <module>
    suffix_for_non_gene_features = args.suffix_for_non_gene_features,
  File "/cluster/home/hshirley/conga/conga/preprocess.py", line 393, in read_dataset
    store_tcrs_in_adata( adata, tcrs )
  File "/cluster/home/hshirley/conga/conga/preprocess.py", line 168, in store_tcrs_in_adata
    adata.obs['cdr3a_nucseq'] = adata.obs.cdr3a_nucseq.str.lower()
  File "/cluster/home/hshirley/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 5456, in __getattr__
    return object.__getattribute__(self, name)
  File "/cluster/home/hshirley/.local/lib/python3.7/site-packages/pandas/core/accessor.py", line 180, in __get__
    accessor_obj = self._accessor(obj)
  File "/cluster/home/hshirley/.local/lib/python3.7/site-packages/pandas/core/strings/accessor.py", line 154, in __init__
    self._inferred_dtype = self._validate(data)
  File "/cluster/home/hshirley/.local/lib/python3.7/site-packages/pandas/core/strings/accessor.py", line 218, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!

I'm not sure if it is the cause but when I look inside the resulting aggregated clones mappings file, I see that all the prefix-barcodes have -1-0 appended, so it looks like:

prefix-barcode-1-0

This format doesn't match any barcodes in the gex data file as the barcodes were renamed to be prefix_barcode.

I wonder what the issue is? Thanks for your help! @s2hui

sschattgen commented 2 years ago

Hi @s2hui,

The error is due to the misalignment of barcodes between the clones file and the GEX matrix. Could you share the details of the make_10x_clones_file_batch command you ran as well as the first 5 or so barcodes from one of the filtered_contig_annotations.csv files and the integrated GEX matrix prior to the merger?

s2hui commented 2 years ago

Hi, I went over my call to run conga and indeed I wasn't using the correct gex data file! It is working now after I followed all your steps above. I appreciate all your help! @s2hui

phbradley commented 1 year ago

Hi there, For some reason I am having trouble seeing the context for this email on github. Is this a post on one of the open issues, or is it an email directly to me? So, when you say below, "I performed the procedure as above" I can't figure out what that refers to. Maybe include a bit more context or let me know which issue it is? When I click on issue #28 (from the subject line) I don't see the post. Take care, Phil


From: leeanapeters @.> Sent: Tuesday, August 2, 2022 12:10 PM To: phbradley/conga @.> Cc: Subscribed @.***> Subject: Re: [phbradley/conga] Question about using previously integrated gex data (#28)

Hi I am having an issue with combining keeping the batch parameters to compare groups and performing the make_10x_clones_file_batch merge.

I performed the procedure as above but then modified my adata object using a batch_info file (containing barcodes and other metadata ie patient and condition) and set those as the batch keys.

When I attempt to use the clone file generated from the make_10x_clones_file_batch along with the gex exported from the adata object, I receive this error even though the mapping file is in the same directory assert exists(kpca_file) AssertionError.

When I use --no_kpca i receive this error: total barcodes: 18863 (18863, 18327) reading: all_clones_w_prefix_added.tsv WARNING: missing kpca_file: all_clones_w_prefix_added_AB.dist_50_kpcs WARNING: X_tcr_pca will be empty Reducing to the 0 barcodes (out of 18863) with paired TCR sequence data

Any help would be appreciated!

Leeana

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_phbradley_conga_issues_28-23issuecomment-2D1203110539&d=DwMCaQ&c=eRAMFD45gAfqt84VtBcfhfEazhEXT91ASHynm_9f1N0&r=OoOdU4GyDM4g0P0UJHufcJpPOVmpY9zfZYFqEZ7QEzw&m=yn0YtuJvzxLiQBJ_FSV_95HYJe_UM8Ih76MLDE7eHykqJAO4CPuRzbrzT38vf-e5&s=IGFMe4ODSZd0WjKyndgSVVMuAW0geugVCcDDTiNBxrM&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABBNCH25S6QHK47AIZGCZ6LVXFXDXANCNFSM5F3GDHAA&d=DwMCaQ&c=eRAMFD45gAfqt84VtBcfhfEazhEXT91ASHynm_9f1N0&r=OoOdU4GyDM4g0P0UJHufcJpPOVmpY9zfZYFqEZ7QEzw&m=yn0YtuJvzxLiQBJ_FSV_95HYJe_UM8Ih76MLDE7eHykqJAO4CPuRzbrzT38vf-e5&s=Wz1XmbzxZyJ7glzCIqv4EBbxxjpNTXZ5Z9Q_7xFm3F4&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>