Open CaiguanxiDeng opened 1 year ago
I'm not familiar with the data format. Is the cell barcode sufficient to infer the sample ID, or is there any other field in the read corresponding to the sample id? Thank you.
Thanks for the answering. The GEL library carries the gene expression (GEX), BCR is the VDJ one, and CSP (not CST) is the feature barcode library carries both cell feature barcode and Hashtag barcode. So the CSP library is specifically used for the demultiplexing, i.e. differentiating the sample IDs by Hashtag barcode. The cell barcode cannot be used for the sample ID classification. Thanks for the answering. The GEL library carries the gene expression (GEX), BCR is the VDJ one, and CSP (not CST) is the feature barcode library carries both cell feature barcode and Hashtag barcode. So the CSP library is specifically used for the demultiplexing, i.e. differentiating the sample IDs by Hashtag barcode. The cell barcode cannot be used for the sample ID classification.
TRUST4 currently does not support this type of sample demultiplexing. I think the best way is to split the data into sample-wise files.
An alternative approach is to regard the cell barcode and hashtag barcode as the real cell barcode (--readFormat bc:x:x,bc:y:y, assuming they are on the same read). From the final result, you can use part of the barcode to tell the sampleID. Though you won't be able to conduct barcode error correction with whitelist using this approach. Hope this helps.
Thank you very much indeed. I have another question. Since our preliminary data after sequencing is massive, the raw data is separated into 4 files (i.e. GEX_S1_L001_R1_001.fq, R2.fq; GEX_S1_L002_R1_001.fq, R2.fq, etc.), can I place the running programme as the following: ./run-trust4 -1 GEX_S1_L001_R1_001.fastq.gz -2 Sample_S1_L001_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 Sample_FB_S1_L001_R1_001.fastq.gz -2 Sample_FB_S1_L001_R2_001.fastq.gz -1 GEX_S1_L002_R1_001.fastq.gz -2 Sample_S1_L002_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 Sample_FB_S1_L002_R1_001.fastq.gz -2 Sample_FB_S1_L002_R2_001.fastq.gz (repeating) -o output_folder or are there having some alternatives required for the input?
Many thanks and good day.
Yes, you can use multiple -1, -2 to specify the files.
You may simplify the command if you know the patterns of the file names, such as the lane names. The file name can contain wildcards, and the input option can be something like -1 GEX_S1_L00*_001.fastq.gz -2 GEX_S1_L*_001.fastq.gz
Appreciated for the answering, thank you very much.
Another little question, how long it usually takes to run through the programme with the data with size of 30GB?
It depends on the number of reads from VDJ region. I guess a few hours should be sufficient. If it is too slow, you can add the option "--repseq", which may lose some sensitivity but runs much faster.
Hello, I am currently trying to analyse the clonotype information from a pooled BCR sample. Here the DNA barcodes were used to differentiate the sample of origins from a single captured 5' VDJ scRNA-seq information (we used the 10x BCR kit), which contains a GEL, a CST and a BCR library. So is it possible to differentiate the sample origins by DNA barcode after running through the TRUST4?