phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
80 stars 18 forks source link

Question: Can Conga be run on SMART-seq data ? #17

Open vincenthahaut opened 3 years ago

vincenthahaut commented 3 years ago

Hello!

I recently sequenced some expanded CD8+ T-cells using a SMART-seq protocol. Given the full-length sequencing I was able to simultaneously capture the TCR-A and/or TCR-B information of most of the cells in my dataset. Based on the embedding I can see that some of the clones differ from the rest of the population and I would like to explore their transcriptomic signature.

I was wondering if according to you the Conga pipeline could be repurposed to analyse SMART-seq data ?

Thank you in advance!

phbradley commented 3 years ago

Hi there,

I definitely think it's possible! I am not familiar with the format of the SMART-seq output files. Do you have an example or pointer to an example or documentation? We would just need a gene expression counts matrix with associated gene names and barcode names, and the tcr information on a per-barcode basis. Happy to help make this possible.

Take care, Phil

vincenthahaut commented 3 years ago

The main difference between SMART and 10x in this case is the absence of UMI. This means that we cannot control as well as droplet-based methods for PCR duplicates. The output format looks exactly the same as for 10x but the counts need to be normalized for differences in sequencing depths (e.g., using sum factor or regressing the count out) and then the matrix is log-normalized. After that I think the matrix could be used in Conga.

The TCR information can be extracted with TRAcer or TRUST4 and easily converted to the TCRdist format.

If in principle this sounds fine with you I will give it a try using the current pipeline and see how far I can go without breaking it :-)

grst commented 3 years ago

Hi @phbradley,

just happend to read this. An easy way for you to support more formats could be by integrating with scirpy. We already have reader functions for several formats including the AIRR standard and TraCeR. Scirpy stores the data in an anndata object with additional TCR-related columns in .obs. You could either use that directly, or we can discuss adding a to_conga function to export the data to whatever format you require.

LMK if you are interested.

Cheers, Gregor

vincenthahaut commented 3 years ago

EDIT: I think I solved the issue. One of the clone had "_" in the CDR3 sequence. Probably a missing AA.

I unfortunately ran into some troubles with the format of the VDJ file I think.

I used TRUST4 (https://github.com/liulab-dfci/TRUST4) to call the VDJ rearrangements as in my dataset Tracer failed to annotate many of the clones. This software provides a TRUST4 ==> 10x format script but it seems to be broken. I made my own version of it and converted my data to a 10x format that should be compatible with Conga.

make_10x_clones_file found ~540 clones which is very close to what I defined myself. However the code broke at conga.preprocess.make_tcrdist_kernel_pcs_file_from_clones_file.

image

The pipeline is working with the demo pbmc3_tcell dataset.

Would you know what could be the issue ?

In addition, I was exploring the intermediate clones_file. Could you tell me what should contain the va2_gene/ja2_gene/cdr3a2/cdr3a2_nucseq columns ? They seem to be empty in my clones_file

Thank you in advance!

phbradley commented 3 years ago

Hi Radek, Thanks for giving this a shot! It looks to me like there is an '_' character in one of the CDR3 regions. Perhaps that's from an out-of-frame sequence? Happy to assist in any way here. Take care, Phil

vincenthahaut commented 3 years ago

Thanks!

I was just updating my previous post. I indeed found a CDR3 sequence with a "_" :-)

phbradley commented 3 years ago

Hi Gregor, Thanks so much! Yes, we are aware of scirpy and it seems like a very useful tool, certainly more advanced than conga in terms of usability and robustness! I really appreciate your offer of help. Let me mull it over; my only tiny hesitation, being someone who is not a developer/programmer but rather a biochemist, is fear of introducing another dependency. Anyhow, I am putting a grant in tomorrow and then will have time to get back to this. Take care, Phil

vincenthahaut commented 3 years ago

I had to make a few more adjustments to the code, especially at the GEX filtering but it led to some results.

Replacing

# adata = conga.preprocess.filter_and_scale(adata)

By:

adata = conga.preprocess.filter_normalize_and_hvg(
        adata, min_genes=200, n_genes=5600, percent_mito=25,
        exclude_TR_genes= True, exclude_sexlinked=True)
sc.pp.regress_out(adata, ['n_counts','percent_mito'])
sc.pp.scale(adata, max_value=10)
adata.layers['scaled'] = sc.pp.scale(adata, max_value=10, copy=True).X

For whatever reason n_genes / percent_mito are harcoded somewhere in conga.preprocess.filter_and_scale and I was not able to change them in another way.

I got a nice looking TCR UMAP:

Screenshot 2021-06-15 at 21 12 27

But it seems that the downstream part did not show much association between the TCR and GEX. I am not sure I trust the results in this case as I know that some of the clones here have a clearly different gene expression. I guess the issue may lie in the hvg detection and/or the small pink population which are doublets I forgot to remove.

Screenshot 2021-06-15 at 21 10 16

So in principle the code can be run on SMART-seq data ;-) However I am not yet entirely sure that all the processing steps are 100% compatible with the data structure.

grst commented 3 years ago

Hi Phil,

I can understand the concern about adding additional dependencies. However, scirpy produces a vanilla anndata object with certain column naming conventions in obs (which are actually AIRR standard column names), so you can read that in with anndata without any need to have scirpy installed.

I also wouldn't remove direct support for 10x files -- this is definitely helpful for people who just want to quickly run the conga pipeline on their data and are not familiar with Python.

IMO, being able to read in scirpy-formatted anndata objects would mostly be useful for the following two use-cases:

Cheers, Gregor