liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
260 stars 46 forks source link

Issue with hashed samples and merging TRUST4 results with Seurat object #177

Open learning-MD opened 1 year ago

learning-MD commented 1 year ago

Hi,

Thanks for this great tool and the great support you guys have provided here!

I had previously posted about running TRUST4 on multiplexed 10X 5' human PBMC scRNA-seq samples (https://github.com/liulab-dfci/TRUST4/issues/150). The fastqs I have looks as below:

image

The FB files, I believe, contain the barcodes. With that in mind, should I run TRUST4 with the following code?

./run-trust4 -t 8 -1 Sample_S1_L005_R1_001.fastq.gz -2 Sample_S1_L005_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 Sample_FB_S1_L005_R1_001.fastq.gz -2 Sample_FB_S1_L005_R2_001.fastq.gz -o output_folder

Or would you recommend just running TRUST4 on the BAM files? Additionally, is the "cell_id" column of the airr_report.tsv file equivalent to the cell barcode from 10X? For one of the multiplexed samples (containing cells from 5 different donors), the airr_report.tsv includes 4472 total observations - when I join that file with my Seurat object metadata (after QC/filtering) using "cell_id" as the anchor, I only have 433 cells left. It doesn't make sense to me, considering that these are PBMCs, with ~18,000 singlets, and a fair number of T- and B-cells. This latter was based on running TRUST4 on the BAM file; hence, I'm re-running TRUST4 on the raw fastqs.

Thanks!

mourisl commented 1 year ago

You shall use --barcode option to specify the barcode file, and --barcodeRange to specify which positions in the barcode file correspond to the barcode sequence. Do you mean the barcode sequence is split into to two files, e.g. the first 8bp in FB_R1, the second 8bp in FB_R2? If the barcode is quite complex, I think starting from BAM file is better as the BAM file has the processed barcodes.

Yes, the "cell_id" in the airr format is the barcode.

I think the low sensitivity could be that since each lane sequences multiple samples, the number of reads in each cell might be less than a typical scRNA-seq data, which decreases the sensitivity of obtaining CDR3 information.

learning-MD commented 1 year ago

Thanks @mourisl! We had targeted 50,000 reads per cell for the gene expression libraries and 5,000 reads per cell for the feature barcode libraries. I would think that that should make up for the multiple samples per lane? I'm also a little confused as to why the airr_report would have >4,000 observations with unique cell_ids - I guess it is theoretically possible that all the QC/filtering I did on the gene expression side (Seurat) could have filtered out most of those cells.

I'll try re-running the BAM and maybe re-running CellRanger with the -include-introns option and see that affects things any differently. I'm open to any other thoughts/suggestions.

mourisl commented 1 year ago

Yes, many of the barcodes containing CDR3 failed the QC test. The include-intron probably won't change the BAM file and TRUST4's results. But it could change the read count/cell and the QC results.

I'm not familiar with the multiome kit sequencing both gene expression and surface marker (I think this Is you case), is the RNA in the gene expression from the nucleus or they are poly-a mRNAs?

learning-MD commented 1 year ago

Thanks @mourisl - it's single-cell RNA-seq. We did not do nucleus-sequencing. If it helps, it wasn't the Multiome kit (that's the snRNA- + snATAC-seq combo). We just added hashtag oligos to samples prior to pooling them and doing the usual 5' single-cell capture and gene expression (https://satijalab.org/seurat/articles/hashing_vignette.html).

That's a bummer to hear the total number of CDR3-containing barcodes (after QC) will just be a fraction of the cells - it's good to know, as we may do the combo GEX + 10X V(D)J for additional experiments in the future to increase the yield.

As a last, potentially dumb question, does using a BAM file from CellRanger 7.0 affect the results if I'm trying to merge with gene expression data from CellRanger 6.1.0 output? I'd imagine it wouldn't do anything significantly since we can recreate the fastq from the BAM file. But, just due to immediate availability of samples I had, I ran TRUST4 on the CellRanger 7.0 output while my completed GEX analyses were initially done on the 6.1.2 outputs - not sure how much I should worry about mismatch here because of that.

Thank you so much for all your help and prompt support!

mourisl commented 1 year ago

I don't think the BAM file will differ much between CellRanger 7.0 and CellRanger 6.1.0 . A few things you might want to check is whether the BAM file contains unmapped reads and whether it contains some strange contigs that correspond to the VDJ region.

mourisl commented 1 year ago

Could you please show me a few lines of the BAM file just to make sure the barcode format is right?

learning-MD commented 1 year ago

Sorry for the delay. Below is what I see:

~$ samtools view samples.bam | head -n 5

A01177:64:HTJ7KDSX3:2:1407:4282:35900 163 chr1 10189 255 12S64M169913N75M = 180283 170206 CCTAACCCTAACCCCTAACCCTAATCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAAACCCTAACCCAAAAACCCTAACCATAACCCTAACCCTAACCCTAAACCTAACCCTAATCCTTACCCTTACCCTAACC FFFFFFFFFFFFFFFF:FFFF,FFFFFFFFFFF,FF,FFFFFFFFFFFFF:FFFFFFF::FFFF,FFFFF,FFFFFFF,FF:,F,FFFF::F:FFFFF,F:FFFF,FFFFF,:F,::,F,F:,,F,:F,:,,F:,::F,FF,FFF,,,F:F NH:i:1 HI:i:1 AS:i:231 nM:i:8 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:4 CR:Z:GTACGTACAGCCACCA CY:Z:FF:FFFFFFFFFFFFF CB:Z:GTACGTACAGCCACCA-1 UR:Z:ATAACAGCCT UY:Z:FFFFFFFFFF UB:Z:ATAACAGCCT A01177:64:HTJ7KDSX3:2:2354:1262:14074 163 chr1 10507 255 20S127M4S = 10544 150 AGGTGGGGAATAATCATCATCTGAGGAGAACTCTGCTCCGCCTTCGCAGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCG FFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:214 nM:i:8 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:4 CR:Z:GGACAGAAGAAAGTGG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GGACAGAAGAAAGTGG-1 UR:Z:AGGACACGCC UY:Z:FFFFFFFFFF UB:Z:AGGACACGCC A01177:64:HTJ7KDSX3:3:2677:17219:21887 163 chr1 10541 3 89M2D25M37S = 10544 116 CCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGCCCCATATAAGAAAGGCGTGTCCTCCACTTTCTTCTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFF NH:i:2 HI:i:1 AS:i:203 nM:i:4 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:0 CR:Z:GGACAGAAGAAAGTGG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GGACAGAAGAAAGTGG-1 UR:Z:AGGACACGCC UY:Z::FFFFFFFFF UB:Z:AGGACACGCC A01177:64:HTJ7KDSX3:2:2354:1262:14074 83 chr1 10544 255 86M2D25M14S = 10507 -150 AAATCTGTGCAGTGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGCCCCATATAAGAAA FFFF::FFFF,F,FFFFFFFF:FF,F,FFF,FF,F,F:FF:F:F:,FFFF:FFFF::F:FFFFFFFFFFFFFFFFF,FF:FFFF,FFFFFFFFF::FFFFF,FF:FFFFFFFFFFF:FFFFFFFF NH:i:1 HI:i:1 AS:i:214 nM:i:8 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:4 CR:Z:GGACAGAAGAAAGTGG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GGACAGAAGAAAGTGG-1 UR:Z:AGGACACGCC UY:Z:FFFFFFFFFF UB:Z:AGGACACGCC A01177:64:HTJ7KDSX3:3:2677:17219:21887 83 chr1 10544 3 86M2D25M14S = 10541 -116 AAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGCCCCATATAAGAAA FFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:2 HI:i:1 AS:i:203 nM:i:4 RG:Z:8256-SB-1:0:1:HTJ7KDSX3:1 RE:A:I xf:i:0 CR:Z:GGACAGAAGAAAGTGG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GGACAGAAGAAAGTGG-1 UR:Z:AGGACACGCC UY:Z::FFFFFFFFF UB:Z:AGGACACGCC

learning-MD commented 1 year ago

I really do think it's a gene expression QC thing - for example, if I merge the airr_report file with my Seurat object with minimal filtering (before filtering nFeatures/nCounts and removing doublets), ~3400 (out of 4472) barcodes merge with the metadata. It's similar with other pooled samples as well. I guess I still am unsure why half the total cells are T cells, as an example, but the TCRs can't be recovered. Definitely open to any additional thoughts/suggestions - this may just be my pure lack of knowledge on doing TCR analyses.

I'm not sure I can "rescue" calls from TRUST4 as true cells? For example, when using the 10X Multiome (RNA + ATAC), some T cells are not called as cells using GEX alone but the ATAC component saves them and allows us to interrogate their transcriptome confidently.

mourisl commented 1 year ago

Yes, it is most likely due to QC filters, maybe this sample has fewer reads or more cells than other samples. For TRUST4 calls that do not overlap with Seurat, TCR call is more reliable than BCRs. Some BCRs have very high expression levels, and the leaked mRNAs can get into the empty gels and cause many BCR calls in null cells. The phenomenon is much less obvious in TCR. However, I would still be cautious about that. Since these cells have low quality, even if you put them back in the T cell population as TRUST4 calls them, it may mess up the clustering and other downstream analysis.

You can also look at some QC plots, e.g. the distribution of nFeatures, to see whether the default cutoffs are too stringent in your data.

learning-MD commented 1 year ago

Thanks again for all your help @mourisl. One question I had, as I'm using the data as input to scRepertoire: as I'm running the barcoderep-to-10X.pl, I'm getting the following error:

./trust-barcoderep-to-10X.pl TRUST_1_unassigned_alignments_barcode_report.tsv filtered_contig_annotations

readline() on closed filehandle FP at ./trust-barcoderep-to-10X.pl line 60.
readline() on closed filehandle FP at ./trust-barcoderep-to-10X.pl line 61.

The output B and T files have column headers but are otherwise empty. As someone who doesn't use Perl much, do you have a suggestion on how to modify the code to make the appropriate conversion? My tentative plan is to merge the B and T files for each group of hashed samples together and merge them with the Seurat object using the createHTOContigList() function from scRepertoire. Thanks!

mourisl commented 1 year ago

It seems the script could not find the TRUST_1_unassigned_alignments_barcode_report.tsv file. Is the file on the right path, or is it empty?

learning-MD commented 1 year ago

I'm an idiot. I just needed to give myself permission to access the barcode_report.tsv file. Thanks!

learning-MD commented 1 year ago

To continue the same thread with a question, the following error popped up with one of my bam files while the rest of the bam files worked without any issue:

$ ./run-trust4 -t 8 -b 5_unassigned_alignments.bam -f hg38_bcrtcr.fa --ref human_IMGT+C.fa --barcode CB

[Wed Feb  1 17:40:21 2023] TRUST4 begins.
[Wed Feb  1 17:40:21 2023] SYSTEM CALL: /home//TRUST4/bam-extractor -b 5_unassigned_alignments.bam -t 8 -f hg38_bcrtcr.fa -o TRUST_5_unassigned_alignments_toassemble  --barcode CB
[Wed Feb  1 17:40:21 2023] Start to extract candidate reads from bam file.
[Wed Feb  1 17:40:30 2023] Finish obtaining the candidate read ids.
[Wed Feb  1 17:40:40 2023] Finish extracting reads.
[Wed Feb  1 17:40:40 2023] SYSTEM CALL: /home//TRUST4/trust4  -t 8 -f hg38_bcrtcr.fa -o TRUST_5_unassigned_alignments -1 TRUST_5_unassigned_alignments_toassemble_1.fq -2 TRUST_5_unassigned_alignments_toassemble_2.fq --barcode TRUST_5_unassigned_alignments_toassemble_bc.fa
[Wed Feb  1 17:40:40 2023] SYSTEM CALL: /home//TRUST4/annotator -f human_IMGT+C.fa -a TRUST_5_unassigned_alignments_final.out -t 8 -o TRUST_5_unassigned_alignments --barcode --outputCDR3File --airrAlignment > TRUST_5_unassigned_alignments_annot.fa
Need to use -a to specify the assembly file.
system /home//TRUST4/annotator -f human_IMGT+C.fa -a TRUST_5_unassigned_alignments_final.out -t 8 -o TRUST_5_unassigned_alignments --barcode --outputCDR3File --airrAlignment > TRUST_5_unassigned_alignments_annot.fa failed: 256 at ./run-trust4 line 51.

Any suggestions? When I run the fastqs for this file, it works - just an issue with the barcodes since they're in a separate file. So I figured I'd run the bam file instead. Thanks!

mourisl commented 1 year ago

This BAM file may have no VDJ reads, so TRUST4 did not output the files needed by annotator and caused the crash.

learning-MD commented 1 year ago

Hmm, it's all human PBMCs. The interesting thing is that running TRUST4 on the fastq files does seem to generate reads:

image

The only problem I had there was that I wasn't sure how to reference the two feature barcode fastqs (see post 1 in this thread; since this was 5 samples hashed together) on the command line. So, the cell-id/barcode column is empty for all reads. That's why I switched over to running TRUST4 on all the BAM files instead. The following is what this BAM files looks like:

samtools view 5_unassigned_alignments.bam | head -n 5

A00758:274:HTNK5DSX3:3:1677:2031:36292  163     chr1    10507   255     19S132M =       10539   146     GGTGGGGAATAATCATCATCTGAGGAGAACTCTGCTCCGCCTTCGCAGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGC FFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFF,FFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1  HI:i:1  AS:i:216        nM:i:10 RG:Z:8256-SB-8:0:1:HTNK5DSX3:1 RE:A:I   xf:i:4  CR:Z:TCAGGTATCAAACAAG   CY:Z:FFFFFF:FFFFFFFFF   CB:Z:TCAGGTATCAAACAAG-1 UR:Z:CGTGTCCTAC UY:Z:FF:FFFFF:F
UB:Z:CGTGTCCTAC
A00758:274:HTNK5DSX3:1:1462:20735:3615  163     chr1    10507   255     9S123M2D17M2S   =       10539   146     AATCATCATCTGAGGAGAACTCTGCTCCGCCTTCGCAGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGAC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF NH:i:1  HI:i:1  AS:i:222        nM:i:8  RG:Z:8256-SB-8:0:1:HTNK5DSX3:1  RE:A:I  xf:i:4  CR:Z:TCAGGTATCAAACAAG   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:TCAGGTATCAAACAAG-1 UR:Z:CGTGTCCTAC UY:Z:FFFFFFFFFF UB:Z:CGTGTCCTAC
A00758:274:HTNK5DSX3:3:1677:2031:36292  83      chr1    10539   255     91M2D21M13S     =       10507   -146    CACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCCCCATATAAGAAA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,    NH:i:1  HI:i:1  AS:i:216        nM:i:10 RG:Z:8256-SB-8:0:1:HTNK5DSX3:1  RE:A:I  xf:i:4  CR:Z:TCAGGTATCAAACAAG  CY:Z:FFFFFF:FFFFFFFFF    CB:Z:TCAGGTATCAAACAAG-1 UR:Z:CGTGTCCTAC UY:Z:FF:FFFFF:F UB:Z:CGTGTCCTAC
A00758:274:HTNK5DSX3:1:1462:20735:3615  83      chr1    10539   255     91M2D21M13S     =       10507   -146    CACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCCCCATATAAGAAA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    NH:i:1  HI:i:1  AS:i:222        nM:i:8  RG:Z:8256-SB-8:0:1:HTNK5DSX3:1  RE:A:I  xf:i:4  CR:Z:TCAGGTATCAAACAAG  CY:Z:FFFFFFFFFFFFFFFF    CB:Z:TCAGGTATCAAACAAG-1 UR:Z:CGTGTCCTAC UY:Z:FFFFFFFFFF UB:Z:CGTGTCCTAC
A00758:274:HTNK5DSX3:2:1470:25346:18239 99      chr1    10987   1       13S112M =       11225   388     TTTCTTATATGGGAGGCGCAGAGACGCACGCCTACGGGTGGGGGTTGGTGGGGCGTGTGGTGCAGGAGCAAAGTCGCACGGCGCCAGGCTGGGGGCGGGGGGGGGGGGCGCCGTGCACGCGCAGA   FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NH:i:3   HI:i:2  AS:i:240        nM:i:10 RG:Z:8256-SB-8:0:1:HTNK5DSX3:1  RE:A:I  xf:i:0  CR:Z:TCTATTGCACCTCGTT   CY:Z:FFFFFFFFFF::FFFF   CB:Z:TCTATTGCACCTCGTT-1 UR:Z:AATGGATCCT UY:Z:FFFFFFFFFF UB:Z:AATGGATCCT

Does that change your thoughts on what may be causing the crash? Thanks so much!

mourisl commented 1 year ago

Does the file TRUST_5_unassigned_alignments_toassemble_1.fq empty when running with BAM input?

Could you please also check whether the BAM file contains unmapped reads?

learning-MD commented 1 year ago

@mourisl Apologies for the delay - when I checked for unmapped reads in that BAM file, I would get errors despite the head and tail looking "normal". So, I actually re-ran CellRanger to generate a new BAM file, which went through the TRUST4 process just fine. So, in case this happens to others in the future, may be worth re-running CellRanger (for hashed samples, at least, to make getting barcodes easier).

One question I do have, as I'm working to import the data into immunarch - I've used the barcodes from the *_barcode_airr.tsv file to merge with my demultiplexed samples and subsequently break that tsv file into 5 different files (each corresponding to a sample). That file imported easily into immunarch - however, the clonal counts are not imported so I cannot do much clonal diversity analyses between conditions.

Does the "consensus_count" column correspond to # of clones? It's different than the #count column from the *_report.tsv file. If there is a certain column to use from one of the .tsv files, I wasn't sure whether to use cid and sequence_id as interchangeable for merging since not all files have the cell barcode.

Any info there would be great. Thanks again for all your incredibly responsive guidance through all of this! This has been a fantastic package to use and I anticipate using it for many more studies going ahead.

mourisl commented 1 year ago

There are two scenarios, for barcode{report,airr}.tsv file, the information is split into cells, so the abundance usually means the number of the reads support this particular chain or contig.

For the report and airr file without barcode, they can be regarded as pseudo-bulk, so the abundance is the number of cells bearing the same VDJ information (this is the definition starting from v1.0.8).

Immunarch is for bulk data, so I guess they interpret the airr file in a bulk fashion. I think you can try methods like scRepertoire. Or make the consensus_count value to 1, and maybe immunarch internally will coalesce similar entires and add up their counts, which corresponds to the number of cells.

Can you share with me what is the barcode to sample demultiplexing file format look like? Perhaps I can add the functionality to TRUST4 to split sample barcodes into different files automatically. I guess this would be useful for some other users as well.

learning-MD commented 1 year ago

Thanks - it looks like immunarch recently upgraded to single-cell capacity as well. And for some things, like number of clones per sample, I'm able to get results:

image

But when I look at things like distribution of clonotype abundance by disease condition, etc, I don't get usable data since their algorithm automatically makes all "Clones" equal to 1:

image

In terms of the barcode to sample demultiplexing, I did the following:

  1. Made a "barcode" column in the metadata of my Seurat objects before integration (but after QC). This was on each individual hashed sample after identifying each HTO to the sample ID. That way, the original barcode is preserved (as integration sometimes adds numbers to the end or modifies barcodes in some way).
  2. Subsetted the metadata for each batch of hashed samples from the final clustered Seurat object.
  3. Since I had the original barcode as a column, I used that as the anchor with the *_barcode_airr.tsv file's "cell_id" column.
  4. The result of the above inner_join, I split by Sample and created new tsv files split by the Sample.

Is there any way to combine the barcode_airr file with the pseudobulk ones to be able to do clonotype diversity analyses in immunarch? Or no way currently to link the barcoded files with the pseudobulk files? I'll give scRepertoire a shot today as well. Thanks!

mourisl commented 1 year ago
  1. Instead of splitting the results from the barcode_airr.tsv file, you can split them from the barcode_report.tsv file, then you will get 5 sample_barcode_report.tsv files.
  2. Run perl trust-simplerep.pl XXX_cdr3.out --barcodeCnt --filterBarcoderep sample_barcode_report.tsv > sample_report.tsv. This is the report file where each clonotype is the number of barcodes bearing this. This file I think is already compatible immunarch.
  3. If you need the psuedo-bulk AIRR file for one sample, you can run perl trust-airr.pl sample_report.tsv origin_annot.fa --airr-align origin_airr_align.tsv > sample_airr.tsv , where the origin is the files from the original TRUST4 run.

You can do some filters at the sample_barcode_report.tsv or original_barcode_report.tsv to remove some cells, e.g. failed QC.

learning-MD commented 1 year ago

Thanks @mourisl - after lots of banging my head against the desk, I think I found a solution. I just need to import 2 sets of files into immunarch - the original *_barcode_airr.tsv which worked for some functions already. Next, I took your suggestion of #2 in the post above and modified it (unless this is what you were already recommending and I just misread):

  1. Used the larger, hashed file's XXX_cdr3.out (containing data for all 5 samples).
  2. I have individual sample barcode_report.tsv files that I already generated. I'm running each one individually. So, running that code 5 times to generate 5 sample_report.tsv files that correspond to the single larger TRUST4 output (from the original run).
  3. Then, I'm importing the latter files as a separate immunarch dataset with the same metadata, so I can do all the clonotype analyses there. I think that should work - will give it a try and update you tomorrow. Thanks for all the help!