scverse / scirpy

A scanpy extension to analyse single-cell TCR and BCR data.
https://scirpy.scverse.org/en/latest/
BSD 3-Clause "New" or "Revised" License
214 stars 34 forks source link

NAs in the TCR VJ usage #320

Closed GouQiao closed 2 years ago

GouQiao commented 2 years ago

Dear scirpy developer,

Thank you to develop this amazing pacakage for analyzing the scTCR. I have one question about the results. When I analyzed the data with multiple clusters with different conditions, the analysis went successfully. But if I only analyze one cluster with different conditions, I couldn't go through ir.tl.chain_qc(adata), which showed:

AssertionError                            Traceback (most recent call last)
<ipython-input-20-477afca22440> in <module>()
----> 1 ir.tl.chain_qc(adata)

2 frames
/usr/local/lib/python3.7/dist-packages/scirpy/tl/_chain_qc.py in _chain_pairing(adata, mask_ambiguous, mask_has_ir, mask_multichain)
    196     for m in [mask_has_vj1, mask_has_vdj1, mask_has_vj2, mask_has_vdj2]:
    197         # no cell can have a junction_aa sequence but no TCR
--> 198         assert np.setdiff1d(np.where(m)[0], np.where(mask_has_ir)[0]).size == 0
    199 
    200     results[~mask_has_ir] = "no IR"

AssertionError: 

Then I just skipped to VJ and VDJ gene usage analysis, but for the result of VJ gene usage, there is NAs generated , like the following.

Screen Shot 0034-01-31 at 1 57 22 PM

Do you know what caused the NAs and how to solve this problem?

Best.

grst commented 2 years ago

Hi @GouQiao,

thanks for raising the issue!

There seems to be a problem with the input data (there is a CDR3 amino acid sequence, but it was not detected as a receptor, which up to now, I believed should never happen). How was that data generated (10x?) and how do you load it into scirpy?

Then I just skipped to VJ and VDJ gene usage analysis, but for the result of VJ gene usage, there is NAs generated , like the following.

This is not too unexpected, especially if you skip the QC part - for some cells the TCR reconstruction tools are note able to determine all genes (e.g., because no reads were detected in the relevant region). In that case the V-gene is a missing value.

GouQiao commented 2 years ago

Hi Grst,

Thank you for the reply! I load the data from seurat processed rds and then convert to h5ad. Is this the reason?

Best.

grst commented 2 years ago

I load the data from seurat processed rds and then convert to h5ad. Is this the reason?

Also the TCR data? How do you do that? (The gene expression data should not matter, and it's perfectly fine to import it from Seurat)

GouQiao commented 2 years ago

I load the data from seurat processed rds and then convert to h5ad. Is this the reason?

Also the TCR data? How do you do that? (The gene expression data should not matter, and it's perfectly fine to import it from Seurat)

Yes also TCR data. I extract the TCR data as metadata. So the obs actually looks fine.

grst commented 2 years ago

The value of adata.obs['has_ir'] is incorrect somewhere. How do you read the TCR data into Seurat?

GouQiao commented 2 years ago

The value of adata.obs['has_ir'] is incorrect somewhere. How do you read the TCR data into Seurat?

Firstly, I merge the gene expression data and TCR data by scirpy, then I convert to seurat rds file for further analysis. After finishing in scirpy, I converted rds to h5ad for TCR analysis.

grst commented 2 years ago

I see! Could you check if the same thing happens if you use the merged data directly, without converting to Seurat?

Like that we can check if something goes wrong during the conversion, or already during reading in the data.

GouQiao commented 2 years ago

I see! Could you check if the same thing happens if you use the merged data directly, without converting to Seurat?

Like that we can check if something goes wrong during the conversion, or already during reading in the data.

Actually I do the same thing for another dataset. It works perfectly! The only difference is , this dataset only contains one cluster , and the worked dataset have multi-clusters. So I think is this because of this?

GouQiao commented 2 years ago

I see! Could you check if the same thing happens if you use the merged data directly, without converting to Seurat?

Like that we can check if something goes wrong during the conversion, or already during reading in the data.

Hi Gregor,

Sorry to bother you again. But do you have any idea to solve this problem?

Best.

grst commented 2 years ago

Sorry for the delay!

The only difference is , this dataset only contains one cluster , and the worked dataset have multi-clusters.

I doubt that's the problem

Could you check if the same thing happens if you use the merged data directly, without converting to Seurat?

Have you tried that out?

GouQiao commented 2 years ago

Hi Gregor,

Because my aim is to compare the TCR repertoire of interested cluster under two different conditions. I am now thinking instead of subsetting the interested cluster in seurat, I should subset this cluster after QC in scirpy. I will try this way first ! Thank you!

GouQiao commented 2 years ago

Sorry for the delay!

The only difference is , this dataset only contains one cluster , and the worked dataset have multi-clusters.

I doubt that's the problem

Could you check if the same thing happens if you use the merged data directly, without converting to Seurat?

Have you tried that out?

GouQiao commented 2 years ago

Hi Gregor,

After trying different ways, I found the matter is because I merged the objects in seurat, and converted to h5ad files. However, scirpy can't accept the merged objects by seurat, so I have to merge in scirpy instead. This works out. But if I want to keep the clusters analyzed in seurat, is there a better way to solve this problem?

Best.

grst commented 2 years ago

Hi @GouQiao,

glad you figured out a way! One straightforward way to keep the clusters would be to export the cluster id and cell barcode from Seurat to a CSV file and read that from Python. The cell barcode should be consistent between Seurat and scanpy/scirpy.