10x parser wrongly uses UMI as clones / proportion

KyleTCL commented 4 years ago

🐛 Bug

As titled, 10x parser wrongly used the UMI slot as count for clones. However, 10x uses the barcode as the 'count' of cells and 'UMI' as count of transcript.

To Reproduce

Steps to reproduce the behavior:

Read 10x consensus annotation.csv with repLoad immdata <- repLoad("/path/to/consensus_annotation.csv", .format = "10x")
view the data my immdata head(immdata$data)

Expected behavior

Count the number of barcode with the same VDJ (perhaps use just CDR3 at amino acid level) as the count.

Additional context

Since consensus annotation.csv contains no barcode information, probably need to use filtered_contig_annotation.csv instead.

vadimnazarov commented 4 years ago

Hi, thank you for noticing that! We will fix it in the upcoming release at the beginning of the next week.

vadimnazarov commented 4 years ago

Hi, we were looking into it more thoroughly and found it's quite complicated. The key problem here is how do you define a clonotype here - same CDR3 aa? Same CDR3aa + V + J? Same CDR3aa alpha and beta? And what to do in case of two alpha chains? So it's a much deeper issue that we expected it at first. We will look into it after this release to make sure immunarch is going to the direction of single-cell support. However, is there anything else, probably the very simple and basic, that can we do to help you with single cell analysis? For example, we can add an additional column called "Barcode" with barcodes from the original files so you can process them by yourself.

KyleTCL commented 4 years ago

1.I would suggest to maybe default to CDR3aa as clonotype and perhaps provide an alternative strict option to also look at CDR3nt and/or V/J genes. For example:

immdata <- repLoad("/path/to/data.csv", .format = "10x", .clonotype = "CDR3aa")

Besides, perhaps it will be a good idea to also look at alpha and beta chain separately. (Having both in the same dataset will increase the clonotype repertoire*2, i.e. same TCR will be treated differently since they are different entry in the dataframe). This will have to be solved by either defining clonotype as paired alpha/beta. Perhaps just use as it is from 10x data, regardless double alpha/beta. User can clean the data as they see fit. However, I am not sure how this can be integrated into your package for compatibility with other non-single cell method.
An additional column for Barcode will be excellent! This will make integration of transcriptome data easier with immunarch workflow.
As for additional features, visualization such as scatter plot for comparison of 2 samples will be nice.

vadimnazarov commented 4 years ago

Hi @KyleTCL , we updated the package to 0.5.4. It correctly parses and extract clonotypes. To filter clones by barcodes, use the filter_barcode function. Can you please try it and get back to us if any problems arise?

EugeneRumynskiy commented 4 years ago

The issue was solved.

YiweiNiu commented 4 years ago

Hello, sorry to re-open the closed issue.

When using immunarch to read 10x output (specifically, filtered_contig_annotations.csv), I found clonotypes were defined by CDR3 nucleotide sequence+V gene+J gene+Chain. @vadimnazarov also confirmed this here.

I have one question: why not just use clonotype id like 'raw_clonotype_id' or other column names specified by users to define clones?

If so, users can also generate input files by themselves and not worry about the definition of clonotypes.

vadimnazarov commented 4 years ago

Hi @YiweiNiu ,

A very important questions, thank you for asking! The problem lies in different approaches to data analysis:

Single-cell is focused on the cell-level.
AIRR analysis is focused on repertoires.

So in order to compute statistics such as diversity or gene usage, we should know the number of clones per clonotype, i.e., merge "single-cell-clonotypes" into "airr-clonotypes". "Raw_clonotype_id" is about sequence objects only, and AIRR analysis tools work with "sequence+counts" objects.

YiweiNiu commented 4 years ago

Hi @vadimnazarov ,

Thank you for your reply! New to this field, I apologize if my questions are too naive. I am working with 10x VDJ data and not familiar with AIRR. I would like to describe the challenges I encountered.

I have scTCR-seq for several samples from different tissues, i.e. blood, normal tissue, and tumor. Since I wanted to define clonotypes using both TCRA and TCRB chains, so I could not use immunarch directly. My plans were to define clonotypes by myself and then feed it to immunarch for high-level analysis (tasks such as gene usage, repertoire overlaps, diversity). As for the 'number of clones per clonotype', I could also count it by myself.

But I found it difficult to integrate with immunarch after reading part of the source code , as immunarch has its own way to define clonotypes and to compute such statistics. Also, since comparing different groups of cells (such as different clusters or different tissues) is needed, cell barcodes and TCR sequences should be in one data frame.

I wanted to use change-o to define clonotypes in the analysis of scBCR-seq. This would encounter the same problem like scTCR-seq.

Best, Yiwei

vadimnazarov commented 4 years ago

Hi @YiweiNiu

Thank you for the feedback and thank you for descibing your challenges! I appreciate this, it would help us to make the package better. I have several questions for you to make sure I understood you correctly.

1) I wanted to define clonotypes using both TCRA and TCRB chains - would you like to use V/J genes as well? Why?

2) tasks such as gene usage, repertoire overlaps, diversity - What would be the hypotheses you plan to test? What are your goals with this analysis?

3) comparing different groups of cells (such as different clusters or different tissues) - What types of analysis would you like to do that requires clone-level metadata?

4) I wanted to use change-o to define clonotypes in the analysis of scBCR-seq - What BCR analysis methods would you like to apply to the BCR data?

Our next major milestones are full support for paired single-cell data and BCR analysis. Your feedback on your analysis goals and routines are greatly appreciated. It will accelerate the development a lot as we will follow specific use cases. Thank you!

vadimnazarov commented 4 years ago

Hi @YiweiNiu

Would you prefer to discuss this questions via a quick 20min Zoom call? If so, feel free to send an email to support@immunomind.io and we will schedule a call.

vadimnazarov commented 4 years ago

Pinging @YiweiNiu

vadimnazarov commented 4 years ago

We have initial single-cell exploration routines here: https://immunarch.com/articles/web_only/v21_singlecell.html

Please create separate issues in case of additional feature requests and bugs. Thank you!

immunomind / immunarch