liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
256 stars 47 forks source link

Help and idea #160

Open beginner984 opened 1 year ago

beginner984 commented 1 year ago

Hi

I need your personal point of view please

I have two bulk RNA-seq patients (PBMC) on which I run mixcr

For the same two patients I have 5' single cell on which I run TRUST4

I am seeing more clones derived by sc RNA-seq and more clonality derived by Bulk RNA-seq

I named my bulk RNA-seq samples as TCR and my sc RNA-seq as TRUST

Rplot02 Rplot07

Please, personally do you have any idea what could be an interpretation for this?

Thank you for any help

mourisl commented 1 year ago

It is expected to only get hundreds of TCRs from bulk RNA-seq data. This is because TCRs are not as highly expressed as many other genes, and it also depends on the infiltration level of the immune cells. You can also try TRUST4 on the bulk RNA-seq data, which I think usually is more sensitive than MiXCR.

In the single-cell data, since there is a good amount of T cells, so we are guaranteed to find thousands of T cells, and TRUST4 can reconstruct TCRs for many of them.

You can think that in single-cell data, the gene expression within a cell is not confounded by other cells, so it is easier to get the cell-specific sequences.

Based on the number, it feels like the single-cell data has higher clonality. In bulk RNA-seq data, the clonotype count is the expression value. In scRNA-seq data, the clonotype count is the number of cells bearing that clonotype, which is a more accurate estimation.

I'm not sure I get your questions, and hope this helps.

mourisl commented 1 year ago

It's very similar to the command of running the 10X data, except that you don't need the --barcode option. A typical command can be:

{TRUST4_PATH}/run-trust4 -f {TRUST4_PATH}/human_IMGT+C.fa --ref {TRUST4_PATH}/human_IMGT+C.fa -1 P_0001_S19_L001_R1_001.fastq.gz -2 P_0001_S19_L001_R2_001.fastq.gz -o P_0001_S19 -t 8
beginner984 commented 1 year ago

Thank you very much

As I had mentioned I have had two PBMCs on them we profiled with conventional Bulk RNA-seq and 10X 5' scRNA-seq (GEX)

This time, I run TRUST4 on my bulk RNA-seq and scRNA-seq separately and tried to compared the results

I have attached some visualisation here

Rplot-6 Rplot01-12 Rplot02-12 Rplot03-10

My interpretation is: scRNA-seq gives more richness while bulk RNA-seq gives more evenness (and clonality ?)

Please correct me if my interpretation is wrong

Thanks a lot in advance

mourisl commented 1 year ago

Just want to confirm, is the single-cell data filtered by Seurat QC'ed cells (or other method you used)? Because of the leaked mRNAs to the empty droplets, there can be null cells with TCRs/BCRs. These cells should not be considered valid.

Based on the current results you showed, scRNA-seq has higher richness. It seems there are some super-expanded clonotypes in scRNA-seq, so it also has higher clonality (lower evenness). So your interpretation is correct.

beginner984 commented 1 year ago

Thanks a lot

I have used fastq files as input for TRUST4 so I have not filtered anything Im affraid🙁

Sorry I got confused here, as I googled, eveness is the same as clonality so having clonally expanded T cells in single cell means a lower eveness, that means clonality negatively correlated with eveness, right? Please correct me if I am wrong

mourisl commented 1 year ago

Then I guess the current clonality from scRNA-seq might be inflated if we considered those null cells. This is because the null cells are usually contaminated by relatively highly expressed mRNAs, so if one T cell's TCR has high expression, we will see it in multiple null cells. So we need to filter the cells with too few gene expressed or other QC criteria to make sure the clonotype analysis is accurate.

Evenness is defined as the normalized Shannon Entropy, and clonality is 1-evenness. So they are perfectly negatively correlated.

beginner984 commented 1 year ago

Thank you very much

At the moment I have fastq/bam files from 10x 5' single cell RNA-seq (GEX)

I really do not know how to filter those null cells before feeding my fastq files into TRUST4

Any clue/intuition please to get rid of droplets

I need an accurate comparision between bulk RNA-seq and scRNA-seq

Thank you for any help

mourisl commented 1 year ago

You can run Seurat to conduct standard scRNA-seq analysis. The barcodes kept in the clusters are the valid cells. I think once you have processed the data with Seurat, you can use tools like scRepertoire to compute the diversities.

beginner984 commented 1 year ago

Thank you very much But the question is, which input I should feed into TRUST4 to construct clonotype structure after filtering? I can not understand the relashionship between Seurat and TRUST4 here. From cellranger output, I can do Seurat on my 5' scRNA-seq, but, after that, how I could feed that into TRUST4?

mourisl commented 1 year ago

You shall have the trust4_barcode_report.tsv or trust4_barcode_airr.tsv files in your output for the scRNA-seq data. These files have barcode information in them, where each barcode(cell) has the assembled TCR or BCR in it. Then you can filter the cells not retained in Seurat analysis.

beginner984 commented 1 year ago

Sorry I need your kindly point of view in making some conclusion from my data

I have had two healthy PBMCs; before running my main experiments on more samples

I profiled those two samples with conventional bulk RNA-seq, 10x 5' GEX, 10x BCR and 10x TCR

I then run TRUST4 on my fastq files (paired end)

And I finally visualized TRUST4 report.tsv with Immunarch

Does my conclusion is right that with 10x TCR kit I get more clonotypes so 5' GEX can not compensate 10x TCR in getting full length CDR3 even by running TRUST4 on my data?

To be honest, I read your paper, but I do not know yet having these data how I could know if 5' GEX is enough? Do I still need dedicated TCR/BCR kits to get full length CDRs

Although likely bulk RNA-seq gives less clonotypes

Actually I do not know how to compare the results to my make a conclusion

Rplot-7

Rplot-8

Unknown-49

Unknown-51

Thanks for any help in advance

mourisl commented 1 year ago

For the single-cell data, you shall run TRUST4 with "--barcode" option, and use the "trust_barcode_report.tsv" file for the analysis. Otherwise, it will treat the sample as a pseudobulk and you lose the cell information. Running as pseudobulk also makes the results incomparable.

If you are comparing GEX vs 10x TCR/BCR kit, the TCR/BCR kit definitely has higher sensitivity as it amplified the data experimentally. However, the advantage of 5' GEX is the cost. The 5' GEX costs the same as the 3' scRNA-seq data, and it also provides a decent information of immune repertoire. So in sum, if you have enough funding, 10X TCR/BCR will give you a nice coverage of immune repertoire of the good-quality cells. If the cost is a big concern, you can use 5' GEX and run TRUST4 to extract the CDR3s, where the sensitivity can be about half of the TCR/BCR kit (could be higher for certain cell types).

beginner984 commented 1 year ago

Thank you very much

Regarding the Seurat filtering

For 5' GEX

> list.files()
[1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"  
> 
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = getwd())
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 10)

So, I should use the remained barcodes in Seurat object to only filter TRUST_GEX_S6_L001_R2_001_barcode_report.tsv

Or I should also filter TRUST_TCR_S6_L001_R2_001_barcode_report.tsv and TRUST_BCR_S6_L001_R2_001_barcode_report.tsv files by the remained barcodes in Seurat object?

I have done like this

gex1=read.delim("TRUST_GEX_S6_L001_R2_001_barcode_report.tsv")

> dim(gex1)
[1] 2344    6

gex1$X.barcode <- paste0(gex1$X.barcode, "-1")

I then filtered TRUST4_barcode.tsv by remained barcodes in Seurat after filtering by Seurat criteria

`

gex11=gex1[!gex1$X.barcode%in%colnames(pbmc@assays$RNA@data),]

gex11$X.barcode = substr(gex11$X.barcode,1,nchar(gex11$X.barcode)-2)

write.table(gex11,"GEX11.tsv",quote = FALSE,sep = "\t") `

> dim(gex11)
[1] 743   6

You are seeing from 2344 clonotypes, 743 remained

Am I doing right please? Shall I repeat this for BCT and TCR barcode.tsv too?

I have attached the resulted file after some manipulation here

GEX11.txt

But Immunarch can not read this anymore

> immdata <- repLoad(.path="/data/file/folder/TRUST4-master/scripts/TCR/a/")

== Step 1/3: loading repertoire files... ==

Processing "/data/Continuum/Angel/TRUST4-master/scripts/TCR/a/" ...
  -- [1/1] Parsing "/data/Continuum/Angel/TRUST4-master/scripts/TCR/a//GEX11.tsv" -- unsupported format, skipping

== Step 2/3: checking metadata files and merging files... ==

Processing "/data/Continuum/Angel/TRUST4-master/scripts/TCR/a/" ...
  -- Metadata file not found; creating a dummy metadata...

== Step 3/3: processing paired chain data... ==

Done!

Warning messages:
1: Unknown or uninitialised column: `Sample`. 
2: Unknown or uninitialised column: `Sample`.

I do not know in which part of filtering I am doing wrong that Immunarch does not recognise that anymore

Thank you once more

mourisl commented 1 year ago

I think what you did is right. You shall also filter the TCR, BCR barcode report files because TCR/BCR kit also has information on the null cells.

Immuarch is for bulk data. For the single-cell data analysis, you can use tools like scRepertoire or Platypus for the analysis.

beginner984 commented 1 year ago

Sorry you meant I should use trust_barcode_report.tsv for scRepertoire or Platypus as an input? Shall I use trust-barcoderep-to-10X.pl to convert trust_barcode_report.tsv first? Although I tried and I got error please have a look at my error

fi1d18@Server2:/TRUST4-master/scripts/TCR/scripts/scripts$ ls
AddSequenceToCDR3File.pl  GetFullLengthAssembly.pl    trust-cluster.py
barcoderep-expand.py      README.md                   trust-stats.py
barcoderep-filter.py      _t.csv
_b.csv                    trust-barcoderep-to-10X.pl
fi1d18@Server2:/TRUST4-master/scripts/TCR/scripts/scripts$ perl trust-barcoderep-to-10X.pl /data/TRUST4-master/scripts/TCR/scripts/TCR/TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_report.tsv
Use of uninitialized value $prefix in concatenation (.) or string at trust-barcoderep-to-10X.pl line 54.
Use of uninitialized value $prefix in concatenation (.) or string at trust-barcoderep-to-10X.pl line 55.

Any help please?

mourisl commented 1 year ago

They surrport trust_barcode_report.tsv format I believe, so there is no need to convert it to 10X format.

The trust-barcoderep-to-10X.pl requires a second arguments to specify the output file prefix, and it outputs as "XXX_t.csv" and "XXX_b.csv"

beginner984 commented 1 year ago

Thank you, specifying the output file prefix, and it outputs as "XXX_t.csv" did not give error

fi1d18@Server2:/data//TRUST4-master/scripts/TCR/scripts/scripts$ perl trust-barcoderep-to-10X.pl /dataTRUST4-master/scripts/TCR/scripts/TCR/TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_report.tsv /data/TRUST4-master/scripts/TCR/scripts/TCR/TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_t.csv

But I am confused why from TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_report.tsv input which is from TCR kit, I have BCR results too

I got these two from TCR

TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_t.csv_t.csv

and

TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_t.csv_b.csv

mourisl commented 1 year ago

There might be some false positive BCRs in the TCR data, but I believe the fraction will be quite tiny and these could be from the null/ambient cells. They usually will be filtered by matching with Seurat. You can safely ignore the _b.csv file for the TCR kit file.

beginner984 commented 1 year ago

Thank you very much, although Immunarch was working very smooth with TRUST4_report.tsv file but likely does not work with barcode file even after conversion with the perl script (or might be I am making a mistake somewhere)

> immdata <- repLoad(.path="/data/TRUST4-master/scripts/TCR/scripts/TCR/a/")

== Step 1/3: loading repertoire files... ==

Processing "/data/Continuum/Angel/TRUST4-master/scripts/TCR/scripts/TCR/a/" ...
  -- [1/2] Parsing "/data/Continuum/Angel/TRUST4-master/scripts/TCR/scripts/TCR/a//TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_t.csv_b.csv" --   -- [2/2] Parsing "/data/Continuum/Angel/TRUST4-master/scripts/TCR/scripts/TCR/a//TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_t.csv_t.csv" -- 
== Step 2/3: checking metadata files and merging files... ==

Processing "/data/Continuum/Angel/TRUST4-master/scripts/TCR/scripts/TCR/a/" ...
  -- Metadata file not found; creating a dummy metadata...

== Step 3/3: processing paired chain data... ==

Done!

Warning messages:
1: Unknown or uninitialised column: `Sample`. 
2: Unknown or uninitialised column: `Sample`. 
> list.files()
[1] "TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_t.csv_b.csv" "TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_t.csv_t.csv"
> immdata$data
named list()
>

As you are seeing immunarch object is empty :(

mourisl commented 1 year ago

Immuarch is not for single-cell analysis, so it does not read any single-cell format. You can use other methods specific for single-cell immune repertoire analysis, such as scRepertoire. If you just want to get some simple statistics values, you can use the "trust-stats.py" in the scripts folder. You can specify the format through the "-f" option there.

beginner984 commented 1 year ago

Thank you very much for bearing with users

I think those statistics would be very useful

I tried this

fi1d18@Server2:/data/TRUST4-master/scripts/TCR/scripts/scripts$ python trust-stats.py -f csv /data/TRUST4-master/scripts/TCR/scripts/TCR/TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_report.tsv /data/TRUST4-master/scripts/TCR/scripts/TCR/TRUST_PN0340_0002_TCR_S4_L001_R2_001_barcode_t1.csv
usage: trust-stats.py [-h] -r REPFILE [-f FORMAT] [--ntaa NTAA]
trust-stats.py: error: argument -r is required
mourisl commented 1 year ago

You need to use -r option to specify the report file. If it is trust_barcode_report file you need to add "-f TRUST4_barcode_report". Note that this script will NOT filter the barcodes with Seurat object. It is still better to use scRepertoire for more rigorous analysis on the single-cell data.

beginner984 commented 1 year ago

Thanks a million

After filtering by Seurat object and conversion with the perl script, I have taken intersection of clonotypes from TCR/BCR kit and 5' GEX, like below

git

I looked at VDJdb database, from those 65 unique clonotypes specific to 5' GEX, only one was found in VDJdb' related toB2M, and from those 229 clonotypes common between 5' GEX and TCR kit, 95 found inVDJdb` database, does this mean if the rest are novel TCRs or just false positive/ambient/null cells please? Thank you so much for sharing your ideas with us

mourisl commented 1 year ago

From my experience, those clonotypes only found in GEX are still likely to be real in PBMC. On example is that TRUST4 can find gdT cells and the 10X kit will miss those (I think). On tumor samples, there will be some false positives, especially in BCR space due to the overexpression of BCR mRNAs from plasma cells. In that case, you can run the "barcoderep-filter.py" in the scripts folder. The VDJdb annotation is very sparse, so most of the TCRs from a study will be novel TCRs. You can also check the 4368 clonotypes from TCR1.

It seems the sensitivity for the TCRs from GEX part is surprisingly low, are you using TCR alpha/beta pairing as the definition of clonotype? Since the capture rate in GEX for TCR/BCR is low, I usually only use one chain to represent the clonotype, TRB/IGH.

beginner984 commented 1 year ago

I used cdr3 sequences from GEX and TCR for Venn diagram creation Am I right, please? Thank you very much

mourisl commented 1 year ago

Yes, this is right. Are they from TRB and IGH, respectively?

beginner984 commented 1 year ago

Thanks a lot For these Venn diagram I have used TRB/IGH only

3

Although samples are healthy PBMCs (not tumour), for BCR filtering I used

$ python3 barcoderep-filter.py -b /data/TRUST4-master/scripts/TCR/BCR1.tsv > newreport.tsv

Which gives

4

Shall I filter my other files (GEX, TCR) too? Am I doing right please? I am just wondering why I am getting more BCRs after filtering

Also, I noticed scRepertoire does not visualise BCR data (might be I am wring)

> S1 <- read.delim("BCR1.tsv")
> S2 <- read.delim("BCR2.tsv")
> contig_list <- list(S1, S2)
> combined <- combineTRUST4(contig_list,samples = c("bcr1","bcr2"),removeNA=TRUE)
> quantContig(combined, cloneCall="gene+nt", scale = TRUE)

Rplot01


bcr=read.delim("BCR1.tsv")
> head(bcr[1:2,1:4])
X.barcode cell_type chain1
1 ATGAGCCACACGATGT         B      *
  2 CTACTTTGTTCAACCA         B      *
  chain2
1 IGLV1-40*01,*,IGLJ3*02,*,TGCCAGTCCTTTGACAACAGCCTGAGTGGTCCGGTGTTT,CQSFDNSLSGPVF,1.00,ATGAGCCACACGATGT_28117,94.29,0
2            IGKV3-11*01,*,IGKJ3*01,*,TGTCAGCAGCGTAGCAACTGGTTCACTTTC,CQQRSNWFTF,1.97,CTACTTTGTTCAACCA_53325,100.00,0
>

Thanks a lot in advance