liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
285 stars 50 forks source link

The Clonetype analysis with the Hashtag Oligomer Barcodes in it #196

Open CaiguanxiDeng opened 1 year ago

CaiguanxiDeng commented 1 year ago

Hello, I am currently trying to analyse the clonotype information from a pooled BCR sample. Here the DNA barcodes were used to differentiate the sample of origins from a single captured 5' VDJ scRNA-seq information (we used the 10x BCR kit), which contains a GEL, a CST and a BCR library. So is it possible to differentiate the sample origins by DNA barcode after running through the TRUST4?

mourisl commented 1 year ago

I'm not familiar with the data format. Is the cell barcode sufficient to infer the sample ID, or is there any other field in the read corresponding to the sample id? Thank you.

CaiguanxiDeng commented 1 year ago

Thanks for the answering. The GEL library carries the gene expression (GEX), BCR is the VDJ one, and CSP (not CST) is the feature barcode library carries both cell feature barcode and Hashtag barcode. So the CSP library is specifically used for the demultiplexing, i.e. differentiating the sample IDs by Hashtag barcode. The cell barcode cannot be used for the sample ID classification. Thanks for the answering. The GEL library carries the gene expression (GEX), BCR is the VDJ one, and CSP (not CST) is the feature barcode library carries both cell feature barcode and Hashtag barcode. So the CSP library is specifically used for the demultiplexing, i.e. differentiating the sample IDs by Hashtag barcode. The cell barcode cannot be used for the sample ID classification.

mourisl commented 1 year ago

TRUST4 currently does not support this type of sample demultiplexing. I think the best way is to split the data into sample-wise files.

An alternative approach is to regard the cell barcode and hashtag barcode as the real cell barcode (--readFormat bc:x:x,bc:y:y, assuming they are on the same read). From the final result, you can use part of the barcode to tell the sampleID. Though you won't be able to conduct barcode error correction with whitelist using this approach. Hope this helps.

CaiguanxiDeng commented 1 year ago

Thank you very much indeed. I have another question. Since our preliminary data after sequencing is massive, the raw data is separated into 4 files (i.e. GEX_S1_L001_R1_001.fq, R2.fq; GEX_S1_L002_R1_001.fq, R2.fq, etc.), can I place the running programme as the following: ./run-trust4 -1 GEX_S1_L001_R1_001.fastq.gz -2 Sample_S1_L001_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 Sample_FB_S1_L001_R1_001.fastq.gz -2 Sample_FB_S1_L001_R2_001.fastq.gz -1 GEX_S1_L002_R1_001.fastq.gz -2 Sample_S1_L002_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 Sample_FB_S1_L002_R1_001.fastq.gz -2 Sample_FB_S1_L002_R2_001.fastq.gz (repeating) -o output_folder or are there having some alternatives required for the input?

Many thanks and good day.

mourisl commented 1 year ago

Yes, you can use multiple -1, -2 to specify the files.

You may simplify the command if you know the patterns of the file names, such as the lane names. The file name can contain wildcards, and the input option can be something like -1 GEX_S1_L00*_001.fastq.gz -2 GEX_S1_L*_001.fastq.gz

CaiguanxiDeng commented 1 year ago

Appreciated for the answering, thank you very much.

CaiguanxiDeng commented 1 year ago

Another little question, how long it usually takes to run through the programme with the data with size of 30GB?

mourisl commented 1 year ago

It depends on the number of reads from VDJ region. I guess a few hours should be sufficient. If it is too slow, you can add the option "--repseq", which may lose some sensitivity but runs much faster.