liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
277 stars 49 forks source link

Possibility of extracting subset of fq according to barcodeTranslate? #301

Open yuyuleung opened 2 months ago

yuyuleung commented 2 months ago

Hello Dr. Liu,

Thank you for creating such a great tool and keeping it updated. I am trying to perform assembly of TCR with my spatial transcript data in single cell mode. However, since the size of my data is always large, I have to split it into several parts to perform parallel assembly. However, the speed of splitting the fastq files is slow and I have noticed that the reads extraction step of fastq files is dependent on the barcodeTranslate file provided. But, when only the subset of barcodeTranslate file and the complete fq file are given, a barcode is found missing in the fastq file and an error occurs.

Therefore, I was wondering if it is possible to modify the FastqExtractor function to skip reads that are not in the barcodeTranslate file. This would allow me to skip the splitting step before assembly.

Thank you very much for your help!

Best wishes, Yuyu

mourisl commented 2 months ago

Could you please elaborate why you provide a subset of barcdoeTranslate file? I can't recall your read layout, is this split barcode, so there is no "full-barcode" whitelist?

yuyuleung commented 2 months ago

Dear Dr. Liu,

thanks for your answer.

As I have mentioned that my spatial dataset is always large, in other works, there are too many spots (cells) and too many reads of each apot (cell). Therefore, it takes always too much time to perform assembly of all spots at the same time. So, I always split reads in several parts firstly, in other words, split the complete fastq file into many parts, so that many assemblies for relatively small dataset can be performed parallel.

Then the step taking time now is spliting fastq file into many parts. Therefore, I am wondering whether I can use the Fqextractor function to achieve it without taking too much time. In other word, I can just split my barcodeTranslate file into many parts and the FqExtractor can help me to extract the corrssponding fq/reads from the raw fq file?

Btw, I am not providing whitelist to correct the barcode, instead I just provide a barcodeTranslate file including all possible barcodes of each spot/cell.

I am not sure whether I have explained well enough.

Thank you for your attention.

Best wishes, Yuyu

在 2024年8月7日,22:28,Li Song @.***> 写道:

 Could you please elaborate why you provide a subset of barcdoeTranslate file? I can't recall your read layout, is this split barcode, so there is no "full-barcode" whitelist?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

mourisl commented 2 months ago

I just pushed a new update to the "dev" branch. It introduces a new option "--skipBarcodeErrorRead" option to the fastq-extractor program. With this option, it will skip the reads with uncorrectable barcode errors or the barcode is not in the translation table. Is this what you need? If it works fine on your data, I will merge it to the master branch. Thank you!

yuyuleung commented 2 months ago

Hello Dr. Liu,

thank you so much for your help. I will test with it later and give you feedback as soon as possible.

Best wishes, Yuyu

在 2024年8月8日,11:41,Li Song @.***> 写道:

 I just pushed a new update to the "dev" branch. It introduces a new option "--skipBarcodeErrorRead" option to the fastq-extractor program. With this option, it will skip the reads with uncorrectable barcode errors or the barcode is not in the translation table. Is this what you need? If it works fine on your data, I will merge it to the master branch. Thank you!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

yuyuleung commented 2 months ago

Hello Dr. Liu,

should I add this option "--skipBarcodeErrorRead" directly on the run-trsut4?

Thanks. Yuyu

mourisl commented 2 months ago

That option is for fastq-extractor only, it is not in the run-trust4 wrapper.

yuyuleung commented 2 months ago

Hello Dr. Liu,

Thank you so much for your effort. I have tested with my data (there are total 434M reads) and I have tried to extract reads of two spot (35 reads). The extraction step took arount 2 hours. I think it is still a little bit slow. I wil keep trying with different dataset and check its efficiency. Do you have also any suggestions to me, how I can split my dataset efficiently into several parts in order to speed up the assembly?

Thank you so much!

Best wishes, Yuyu

mourisl commented 2 months ago

Is this TCR-targeted sequencing data, or it is gene expression data?

yuyuleung commented 2 months ago

It is TCR-targeted sequencing data.

I have many different TCR/BCR-targeted sequencing data. According to different enrichment efficiency, I can get around 50M - 200M TCR/BCR reads.

What I have tested before (434M reads) was the raw data (incl. adapters or other genes reads). I have just tested just now with clean reads (around 50M) to extract also 35 reads from it. It took only 10 minutes. I think it is better and efficient with smaller data?

I am still testing with larger data (like 200M TCR-targeted reads).

Thanks a lot again! Yuyu

mourisl commented 2 months ago

That makes sense, because 50M is about 1/9 of 434M, so if 50M takes about 10minutes, 434M would take 1.5 hours. I feel like for 50M reads, TRUST4 without splitting might be fast enough?

yuyuleung commented 2 months ago

Yes, it makes sense that extracting two barcodes from 50M is faster.

No, for 50M reads, It takes extremely long to assemble together in sc-mode. I have tested several times, where only 10M reads still costs time (maybe there are too much barcodes?)

Or is it possible that I share you a test data-set (around 50M) so that you can test how to make it faster, besides splitting dataset?

Thank you ao much!

Best wishes, Yuyu

在 2024年8月9日,22:12,Li Song @.***> 写道:

 That makes sense, because 50M is about 1/9 of 434M, so if 50M takes about 10minutes, 434M would take 1.5 hours. I feel like for 50M reads, TRUST4 without splitting might be fast enough?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.