phbradley / conga

Clonotype Neighbor Graph Analysis
MIT License
83 stars 19 forks source link

Question: num cells #30

Closed s2hui closed 3 years ago

s2hui commented 3 years ago

Hello, After running conga, I have the following conga cluster:

image

When I look at the file merged_remedy_graph_vs_graph.tsv, and I look at only the subset of the rows where gex_cluster = 5 and tcr_cluster = 2, I am able to count the proper number of unique clonotypes (=25). I thought that the number of rows in this subset should equal the number of cells (=52), however I only count 44 rows.

I did the same for other conga clusters and while the number of clonotypes matches what is on the figure, I can't get the number of cells to match? I wonder where my error might be?

Thanks for your help, @s2hui

sschattgen commented 3 years ago

Hi,

It's important to remember that CoNGA is operating on the level of clonotypes, not cells, and that each clonotype is represented by one cell even if the clone size is not equal to 1. This means each row in the results table is a clonotype so the number of rows will not equal the number of cells, which would be the sum of the clone sizes for all clonotypes in the cluster. Clone size information is stored in adata.obs. You could count the number of clonotypes and cells in a conga cluster using something like this.

Open the analyzed h5ad object

import scanpy as sc
import pandas as pd
import numpy as np
adata = sc.read_h5ad('vdj_v1_hs_pbmc2_5gex_protein_final.h5ad')

Open your graph-vs-graph results

gvg_hits = pd.read_csv('vdj_v1_hs_pbmc2_5gex_protein_test_graph_vs_graph.tsv', sep = '\t')

Pick your cluster of interest and the nbr_frac that was used for plotting

gvg_7_6 = gvg_hits[(gvg_hits.gex_cluster == 7) & (gvg_hits.tcr_cluster == 6) & (gvg_hits.nbr_frac == 0.1) ]
clonotypes = np.unique(gvg_7_6.clone_index)

This should equal number under 'clonotypes' in plot

print(clonotypes.shape[0])

Remember that we're reducing to one cell per clonotype for the analysis, however, the original number of cells per clonotype before reducing is stored in adata.obs.

gvg_7_6_obs = adata.obs.iloc[clonotypes]
cells = np.sum(gvg_7_6_obs.clone_sizes)

This should equal number under 'cells' in the plot

print(cells) 
s2hui commented 3 years ago

Hi,

Thanks for the detailed steps. I still cannot get the number of cells to match what is in the figure for some reason. The nbr_frac = 0.01 which I double checked in the log.txt and summary.html files.

I loaded the h5ad file and the gvg file.

The ran the following:

>>> g = gvg_hits[(gvg_hits.gex_cluster == 5) & (gvg_hits.tcr_cluster == 2) & (gvg_hits.nbr_frac == 0.01) ]
>>> clonotypes = np.unique(g.clone_index)
>>> print(clonotypes.shape[0])
11

But the number of clonotypes should be 25 according to the gvg logos.png file.

Just out of curiosity, I tried nbr_frac = 0.1 and get the following:

>>> g = gvg_hits[(gvg_hits.gex_cluster == 5) & (gvg_hits.tcr_cluster == 2) & (gvg_hits.nbr_frac == 0.1) ]
>>> clonotypes = np.unique(g.clone_index)
>>> print(clonotypes.shape[0])
22

Strangely, if I try another conga cluster, but use nbr_frac = 0.1, I am able to get the correct number of clonotypes for some of the clusters but most do not match. The number of cells also does not match.

I double checked that the final.h5ad object is the in the same directory as the gvg_logos.png file and gvg hits file that I am using.

Thanks for your help, @s2hui

phbradley commented 3 years ago

Hi there, thanks for using CoNGA! The conga hits include results from both 0.01 and 0.1 nbr_frac analyses. So it should be the case that this would give the expected number of clonotypes:

>>> g = gvg_hits[(gvg_hits.gex_cluster == 5) & (gvg_hits.tcr_cluster == 2)]
>>> clonotypes = np.unique(g.clone_index)
>>> print(clonotypes.shape[0])

Let me know if that's not the case!

For the cells, the prediction is that if you sum the clone sizes for those clonotypes, you get the cell number (e.g. using the clone_sizes column in the *_final_obs.tsv output or the adata.obs.clone_sizes array). If you want the cell barcodes, you could look for the corresponding tcrs in the 10x filtered_contigs file, or use the *_clones.tsv.barcode_mapping.tsv file to map from clone_id to barcodes. We're planning on adding another output to make it easier to trace back to cell barcodes; will update the README when that's done.

s2hui commented 3 years ago

Hi That worked! I executed the following code:

>>> import scanpy as sc
>>> import pandas as pd
>>> import numpy as np
>>> adata = sc.read_h5ad('out_final.h5ad')
>>> gvg_hits = pd.read_csv('out_graph_vs_graph.tsv',sep = '\t')
>>> g = gvg_hits[(gvg_hits.gex_cluster == 5) & (gvg_hits.tcr_cluster == 2)]
>>> clonotypes = np.unique(g.clone_index)
>>> print(clonotypes.shape[0])
25
>>> gg = adata.obs.iloc[clonotypes]
>>> cells = np.sum(gg.clone_sizes)
>>> cells
52

Thanks for your help! @s2hui

AlicePsyche commented 2 years ago

Hi phbradley,

May I ask that do you have any updates about this?

If you want the cell barcodes, you could look for the corresponding tcrs in the 10x filtered_contigs file, or use the *_clones.tsv.barcode_mapping.tsv file to map from clone_id to barcodes. We're planning on adding another output to make it easier to trace back to cell barcodes; will update the README when that's done.

Sorry if it was a simple question. I am not very familiar with python so I prefer to convert the output h5ad file and do the analysis in R. The problem is that it looks like there is no clone_id in the meta data (obs). So how can I extract the cellbarcode information from *clones.tsv.barcode_mapping.tsv file? I found one very interesting TCR cluster and would like to check all the cells in my scRNA-seq data. Thanks a lot for your help!

Best, Alice