Open learning-MD opened 1 year ago
You shall use --barcode option to specify the barcode file, and --barcodeRange to specify which positions in the barcode file correspond to the barcode sequence. Do you mean the barcode sequence is split into to two files, e.g. the first 8bp in FB_R1, the second 8bp in FB_R2? If the barcode is quite complex, I think starting from BAM file is better as the BAM file has the processed barcodes.
Yes, the "cell_id" in the airr format is the barcode.
I think the low sensitivity could be that since each lane sequences multiple samples, the number of reads in each cell might be less than a typical scRNA-seq data, which decreases the sensitivity of obtaining CDR3 information.
Thanks @mourisl! We had targeted 50,000 reads per cell for the gene expression libraries and 5,000 reads per cell for the feature barcode libraries. I would think that that should make up for the multiple samples per lane? I'm also a little confused as to why the airr_report would have >4,000 observations with unique cell_ids - I guess it is theoretically possible that all the QC/filtering I did on the gene expression side (Seurat) could have filtered out most of those cells.
I'll try re-running the BAM and maybe re-running CellRanger with the -include-introns option and see that affects things any differently. I'm open to any other thoughts/suggestions.
Yes, many of the barcodes containing CDR3 failed the QC test. The include-intron probably won't change the BAM file and TRUST4's results. But it could change the read count/cell and the QC results.
I'm not familiar with the multiome kit sequencing both gene expression and surface marker (I think this Is you case), is the RNA in the gene expression from the nucleus or they are poly-a mRNAs?
Thanks @mourisl - it's single-cell RNA-seq. We did not do nucleus-sequencing. If it helps, it wasn't the Multiome kit (that's the snRNA- + snATAC-seq combo). We just added hashtag oligos to samples prior to pooling them and doing the usual 5' single-cell capture and gene expression (https://satijalab.org/seurat/articles/hashing_vignette.html).
That's a bummer to hear the total number of CDR3-containing barcodes (after QC) will just be a fraction of the cells - it's good to know, as we may do the combo GEX + 10X V(D)J for additional experiments in the future to increase the yield.
As a last, potentially dumb question, does using a BAM file from CellRanger 7.0 affect the results if I'm trying to merge with gene expression data from CellRanger 6.1.0 output? I'd imagine it wouldn't do anything significantly since we can recreate the fastq from the BAM file. But, just due to immediate availability of samples I had, I ran TRUST4 on the CellRanger 7.0 output while my completed GEX analyses were initially done on the 6.1.2 outputs - not sure how much I should worry about mismatch here because of that.
Thank you so much for all your help and prompt support!
I don't think the BAM file will differ much between CellRanger 7.0 and CellRanger 6.1.0 . A few things you might want to check is whether the BAM file contains unmapped reads and whether it contains some strange contigs that correspond to the VDJ region.
Could you please show me a few lines of the BAM file just to make sure the barcode format is right?
Sorry for the delay. Below is what I see:
~$ samtools view samples.bam | head -n 5
A01177:64:HTJ7KDSX3:2:1407:4282:35900 163 chr1 10189 255 12S64M169913N75M = 180283 170206 CCTAACCCTAACCCCTAACCCTAATCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAAACCCTAACCCAAAAACCCTAACCATAACCCTAACCCTAACCCTAAACCTAACCCTAATCCTTACCCTTACCCTAACC FFFFFFFFFFFFFFFF:FFFF,FFFFFFFFFFF,FF,FFFFFFFFFFFFF:FFFFFFF::FFFF,FFFFF,FFFFFFF,FF:,F,FFFF::F:FFFFF,F:FFFF,FFFFF,:F,::,F,F:,,F,:F,:,,F:,::F,FF,FFF,,,F:F NH:i:1 HI:i:1 AS:i:231 nM:i:8 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:4 CR:Z:GTACGTACAGCCACCA CY:Z:FF:FFFFFFFFFFFFF CB:Z:GTACGTACAGCCACCA-1 UR:Z:ATAACAGCCT UY:Z:FFFFFFFFFF UB:Z:ATAACAGCCT A01177:64:HTJ7KDSX3:2:2354:1262:14074 163 chr1 10507 255 20S127M4S = 10544 150 AGGTGGGGAATAATCATCATCTGAGGAGAACTCTGCTCCGCCTTCGCAGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCG FFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:214 nM:i:8 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:4 CR:Z:GGACAGAAGAAAGTGG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GGACAGAAGAAAGTGG-1 UR:Z:AGGACACGCC UY:Z:FFFFFFFFFF UB:Z:AGGACACGCC A01177:64:HTJ7KDSX3:3:2677:17219:21887 163 chr1 10541 3 89M2D25M37S = 10544 116 CCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGCCCCATATAAGAAAGGCGTGTCCTCCACTTTCTTCTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFF NH:i:2 HI:i:1 AS:i:203 nM:i:4 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:0 CR:Z:GGACAGAAGAAAGTGG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GGACAGAAGAAAGTGG-1 UR:Z:AGGACACGCC UY:Z::FFFFFFFFF UB:Z:AGGACACGCC A01177:64:HTJ7KDSX3:2:2354:1262:14074 83 chr1 10544 255 86M2D25M14S = 10507 -150 AAATCTGTGCAGTGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGCCCCATATAAGAAA FFFF::FFFF,F,FFFFFFFF:FF,F,FFF,FF,F,F:FF:F:F:,FFFF:FFFF::F:FFFFFFFFFFFFFFFFF,FF:FFFF,FFFFFFFFF::FFFFF,FF:FFFFFFFFFFF:FFFFFFFF NH:i:1 HI:i:1 AS:i:214 nM:i:8 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:4 CR:Z:GGACAGAAGAAAGTGG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GGACAGAAGAAAGTGG-1 UR:Z:AGGACACGCC UY:Z:FFFFFFFFFF UB:Z:AGGACACGCC A01177:64:HTJ7KDSX3:3:2677:17219:21887 83 chr1 10544 3 86M2D25M14S = 10541 -116 AAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGCCCCATATAAGAAA FFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:2 HI:i:1 AS:i:203 nM:i:4 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:0 CR:Z:GGACAGAAGAAAGTGG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GGACAGAAGAAAGTGG-1 UR:Z:AGGACACGCC UY:Z::FFFFFFFFF UB:Z:AGGACACGCC
I really do think it's a gene expression QC thing - for example, if I merge the airr_report file with my Seurat object with minimal filtering (before filtering nFeatures/nCounts and removing doublets), ~3400 (out of 4472) barcodes merge with the metadata. It's similar with other pooled samples as well. I guess I still am unsure why half the total cells are T cells, as an example, but the TCRs can't be recovered. Definitely open to any additional thoughts/suggestions - this may just be my pure lack of knowledge on doing TCR analyses.
I'm not sure I can "rescue" calls from TRUST4 as true cells? For example, when using the 10X Multiome (RNA + ATAC), some T cells are not called as cells using GEX alone but the ATAC component saves them and allows us to interrogate their transcriptome confidently.
Yes, it is most likely due to QC filters, maybe this sample has fewer reads or more cells than other samples. For TRUST4 calls that do not overlap with Seurat, TCR call is more reliable than BCRs. Some BCRs have very high expression levels, and the leaked mRNAs can get into the empty gels and cause many BCR calls in null cells. The phenomenon is much less obvious in TCR. However, I would still be cautious about that. Since these cells have low quality, even if you put them back in the T cell population as TRUST4 calls them, it may mess up the clustering and other downstream analysis.
You can also look at some QC plots, e.g. the distribution of nFeatures, to see whether the default cutoffs are too stringent in your data.
Thanks again for all your help @mourisl. One question I had, as I'm using the data as input to scRepertoire: as I'm running the barcoderep-to-10X.pl, I'm getting the following error:
./trust-barcoderep-to-10X.pl TRUST_1_unassigned_alignments_barcode_report.tsv filtered_contig_annotations
readline() on closed filehandle FP at ./trust-barcoderep-to-10X.pl line 60.
readline() on closed filehandle FP at ./trust-barcoderep-to-10X.pl line 61.
The output B and T files have column headers but are otherwise empty. As someone who doesn't use Perl much, do you have a suggestion on how to modify the code to make the appropriate conversion? My tentative plan is to merge the B and T files for each group of hashed samples together and merge them with the Seurat object using the createHTOContigList() function from scRepertoire. Thanks!
It seems the script could not find the TRUST_1_unassigned_alignments_barcode_report.tsv file. Is the file on the right path, or is it empty?
I'm an idiot. I just needed to give myself permission to access the barcode_report.tsv file. Thanks!
To continue the same thread with a question, the following error popped up with one of my bam files while the rest of the bam files worked without any issue:
$ ./run-trust4 -t 8 -b 5_unassigned_alignments.bam -f hg38_bcrtcr.fa --ref human_IMGT+C.fa --barcode CB
[Wed Feb 1 17:40:21 2023] TRUST4 begins.
[Wed Feb 1 17:40:21 2023] SYSTEM CALL: /home//TRUST4/bam-extractor -b 5_unassigned_alignments.bam -t 8 -f hg38_bcrtcr.fa -o TRUST_5_unassigned_alignments_toassemble --barcode CB
[Wed Feb 1 17:40:21 2023] Start to extract candidate reads from bam file.
[Wed Feb 1 17:40:30 2023] Finish obtaining the candidate read ids.
[Wed Feb 1 17:40:40 2023] Finish extracting reads.
[Wed Feb 1 17:40:40 2023] SYSTEM CALL: /home//TRUST4/trust4 -t 8 -f hg38_bcrtcr.fa -o TRUST_5_unassigned_alignments -1 TRUST_5_unassigned_alignments_toassemble_1.fq -2 TRUST_5_unassigned_alignments_toassemble_2.fq --barcode TRUST_5_unassigned_alignments_toassemble_bc.fa
[Wed Feb 1 17:40:40 2023] SYSTEM CALL: /home//TRUST4/annotator -f human_IMGT+C.fa -a TRUST_5_unassigned_alignments_final.out -t 8 -o TRUST_5_unassigned_alignments --barcode --outputCDR3File --airrAlignment > TRUST_5_unassigned_alignments_annot.fa
Need to use -a to specify the assembly file.
system /home//TRUST4/annotator -f human_IMGT+C.fa -a TRUST_5_unassigned_alignments_final.out -t 8 -o TRUST_5_unassigned_alignments --barcode --outputCDR3File --airrAlignment > TRUST_5_unassigned_alignments_annot.fa failed: 256 at ./run-trust4 line 51.
Any suggestions? When I run the fastqs for this file, it works - just an issue with the barcodes since they're in a separate file. So I figured I'd run the bam file instead. Thanks!
This BAM file may have no VDJ reads, so TRUST4 did not output the files needed by annotator and caused the crash.
Hmm, it's all human PBMCs. The interesting thing is that running TRUST4 on the fastq files does seem to generate reads:
The only problem I had there was that I wasn't sure how to reference the two feature barcode fastqs (see post 1 in this thread; since this was 5 samples hashed together) on the command line. So, the cell-id/barcode column is empty for all reads. That's why I switched over to running TRUST4 on all the BAM files instead. The following is what this BAM files looks like:
samtools view 5_unassigned_alignments.bam | head -n 5
A00758:274:HTNK5DSX3:3:1677:2031:36292 163 chr1 10507 255 19S132M = 10539 146 GGTGGGGAATAATCATCATCTGAGGAGAACTCTGCTCCGCCTTCGCAGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGC FFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFF,FFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:216 nM:i:10 RG:Z:8256-SB-8:0:1:HTNK5DSX3:1 RE:A:I xf:i:4 CR:Z:TCAGGTATCAAACAAG CY:Z:FFFFFF:FFFFFFFFF CB:Z:TCAGGTATCAAACAAG-1 UR:Z:CGTGTCCTAC UY:Z:FF:FFFFF:F
UB:Z:CGTGTCCTAC
A00758:274:HTNK5DSX3:1:1462:20735:3615 163 chr1 10507 255 9S123M2D17M2S = 10539 146 AATCATCATCTGAGGAGAACTCTGCTCCGCCTTCGCAGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGAC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:222 nM:i:8 RG:Z:8256-SB-8:0:1:HTNK5DSX3:1 RE:A:I xf:i:4 CR:Z:TCAGGTATCAAACAAG CY:Z:FFFFFFFFFFFFFFFF CB:Z:TCAGGTATCAAACAAG-1 UR:Z:CGTGTCCTAC UY:Z:FFFFFFFFFF UB:Z:CGTGTCCTAC
A00758:274:HTNK5DSX3:3:1677:2031:36292 83 chr1 10539 255 91M2D21M13S = 10507 -146 CACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCCCCATATAAGAAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF, NH:i:1 HI:i:1 AS:i:216 nM:i:10 RG:Z:8256-SB-8:0:1:HTNK5DSX3:1 RE:A:I xf:i:4 CR:Z:TCAGGTATCAAACAAG CY:Z:FFFFFF:FFFFFFFFF CB:Z:TCAGGTATCAAACAAG-1 UR:Z:CGTGTCCTAC UY:Z:FF:FFFFF:F UB:Z:CGTGTCCTAC
A00758:274:HTNK5DSX3:1:1462:20735:3615 83 chr1 10539 255 91M2D21M13S = 10507 -146 CACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCCCCATATAAGAAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:222 nM:i:8 RG:Z:8256-SB-8:0:1:HTNK5DSX3:1 RE:A:I xf:i:4 CR:Z:TCAGGTATCAAACAAG CY:Z:FFFFFFFFFFFFFFFF CB:Z:TCAGGTATCAAACAAG-1 UR:Z:CGTGTCCTAC UY:Z:FFFFFFFFFF UB:Z:CGTGTCCTAC
A00758:274:HTNK5DSX3:2:1470:25346:18239 99 chr1 10987 1 13S112M = 11225 388 TTTCTTATATGGGAGGCGCAGAGACGCACGCCTACGGGTGGGGGTTGGTGGGGCGTGTGGTGCAGGAGCAAAGTCGCACGGCGCCAGGCTGGGGGCGGGGGGGGGGGGCGCCGTGCACGCGCAGA FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:2 AS:i:240 nM:i:10 RG:Z:8256-SB-8:0:1:HTNK5DSX3:1 RE:A:I xf:i:0 CR:Z:TCTATTGCACCTCGTT CY:Z:FFFFFFFFFF::FFFF CB:Z:TCTATTGCACCTCGTT-1 UR:Z:AATGGATCCT UY:Z:FFFFFFFFFF UB:Z:AATGGATCCT
Does that change your thoughts on what may be causing the crash? Thanks so much!
Does the file TRUST_5_unassigned_alignments_toassemble_1.fq empty when running with BAM input?
Could you please also check whether the BAM file contains unmapped reads?
@mourisl Apologies for the delay - when I checked for unmapped reads in that BAM file, I would get errors despite the head and tail looking "normal". So, I actually re-ran CellRanger to generate a new BAM file, which went through the TRUST4 process just fine. So, in case this happens to others in the future, may be worth re-running CellRanger (for hashed samples, at least, to make getting barcodes easier).
One question I do have, as I'm working to import the data into immunarch - I've used the barcodes from the *_barcode_airr.tsv file to merge with my demultiplexed samples and subsequently break that tsv file into 5 different files (each corresponding to a sample). That file imported easily into immunarch - however, the clonal counts are not imported so I cannot do much clonal diversity analyses between conditions.
Does the "consensus_count" column correspond to # of clones? It's different than the #count column from the *_report.tsv file. If there is a certain column to use from one of the .tsv files, I wasn't sure whether to use cid and sequence_id as interchangeable for merging since not all files have the cell barcode.
Any info there would be great. Thanks again for all your incredibly responsive guidance through all of this! This has been a fantastic package to use and I anticipate using it for many more studies going ahead.
There are two scenarios, for barcode{report,airr}.tsv file, the information is split into cells, so the abundance usually means the number of the reads support this particular chain or contig.
For the report and airr file without barcode, they can be regarded as pseudo-bulk, so the abundance is the number of cells bearing the same VDJ information (this is the definition starting from v1.0.8).
Immunarch is for bulk data, so I guess they interpret the airr file in a bulk fashion. I think you can try methods like scRepertoire. Or make the consensus_count value to 1, and maybe immunarch internally will coalesce similar entires and add up their counts, which corresponds to the number of cells.
Can you share with me what is the barcode to sample demultiplexing file format look like? Perhaps I can add the functionality to TRUST4 to split sample barcodes into different files automatically. I guess this would be useful for some other users as well.
Thanks - it looks like immunarch recently upgraded to single-cell capacity as well. And for some things, like number of clones per sample, I'm able to get results:
But when I look at things like distribution of clonotype abundance by disease condition, etc, I don't get usable data since their algorithm automatically makes all "Clones" equal to 1:
In terms of the barcode to sample demultiplexing, I did the following:
Is there any way to combine the barcode_airr file with the pseudobulk ones to be able to do clonotype diversity analyses in immunarch? Or no way currently to link the barcoded files with the pseudobulk files? I'll give scRepertoire a shot today as well. Thanks!
perl trust-simplerep.pl XXX_cdr3.out --barcodeCnt --filterBarcoderep sample_barcode_report.tsv > sample_report.tsv
. This is the report file where each clonotype is the number of barcodes bearing this. This file I think is already compatible immunarch.perl trust-airr.pl sample_report.tsv origin_annot.fa --airr-align origin_airr_align.tsv > sample_airr.tsv
, where the origin is the files from the original TRUST4 run.You can do some filters at the sample_barcode_report.tsv or original_barcode_report.tsv to remove some cells, e.g. failed QC.
Thanks @mourisl - after lots of banging my head against the desk, I think I found a solution. I just need to import 2 sets of files into immunarch - the original *_barcode_airr.tsv which worked for some functions already. Next, I took your suggestion of #2 in the post above and modified it (unless this is what you were already recommending and I just misread):
Hi,
Thanks for this great tool and the great support you guys have provided here!
I had previously posted about running TRUST4 on multiplexed 10X 5' human PBMC scRNA-seq samples (https://github.com/liulab-dfci/TRUST4/issues/150). The fastqs I have looks as below:
The FB files, I believe, contain the barcodes. With that in mind, should I run TRUST4 with the following code?
./run-trust4 -t 8 -1 Sample_S1_L005_R1_001.fastq.gz -2 Sample_S1_L005_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 Sample_FB_S1_L005_R1_001.fastq.gz -2 Sample_FB_S1_L005_R2_001.fastq.gz -o output_folder
Or would you recommend just running TRUST4 on the BAM files? Additionally, is the "cell_id" column of the airr_report.tsv file equivalent to the cell barcode from 10X? For one of the multiplexed samples (containing cells from 5 different donors), the airr_report.tsv includes 4472 total observations - when I join that file with my Seurat object metadata (after QC/filtering) using "cell_id" as the anchor, I only have 433 cells left. It doesn't make sense to me, considering that these are PBMCs, with ~18,000 singlets, and a fair number of T- and B-cells. This latter was based on running TRUST4 on the BAM file; hence, I'm re-running TRUST4 on the raw fastqs.
Thanks!