Running TRUST4 with non-10X single-cell data, barcodes are in the RG:Z headers of fastq

liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

MIT License

287 stars 50 forks source link

Running TRUST4 with non-10X single-cell data, barcodes are in the RG:Z headers of fastq #277

Open LukaP-BB opened 6 months ago

LukaP-BB commented 6 months ago

I have fastq files for scDNA with barcodes extracted in the headers, in the RG:Z field.

@A01789:135:HLKCJDMXY:1:1101:1027:1047 RG:Z:CGTGCCTATTCGGACAGT
TTAAATTGGTATCAGAAGAAACCAGGGAAAGCCCCTAAGCTCCTGATCTACGATGCATCCAATCCGGAAACAGGGGTCCCATCAAGGTTCAGTGGAA
+
FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFF:FFFFFFFFFF:,FFFFF:FFFFFFF,F

Is there a way to use this information in a similar fashion as specifying the field when the input is a bam file ? I couldn't find it.

If there is no way to do it currently, what would be your recommended way to specify barcodes ?

I tried extracting the raw barcodes from the headers in a text file, but it seems it isn't the right solution

mourisl commented 6 months ago

Currently, we don't support parsing the barcode in the header. You can extract the raw barcode into another fasta file, like

>A01789:135:HLKCJDMXY:1:1101:1027:1047
CGTGCCTATTCGGACAGT

I will add the feature to parse from the header in the next or next next release.

LukaP-BB commented 6 months ago

Thanks for your swift reply, the solution seemed to work as TRUST4 is now running.

This is a tangent to the original issue, but do you have a recommendation for the number of threads to use ? I launched a test run on 1 thread but it is taking >24 hours to complete on my data. Is the relationship between n_threads and speed linear ?

mourisl commented 6 months ago

I usually use 8 threads. I think the gain probably plateaus after 16 threads. Which step do you find TRUST4 stuck on? Which version of TRUST4 are you using?

LukaP-BB commented 5 months ago

Hi, I'm running trust4 V1.0.5.1 according to conda. I tried again with 20 threads just to be sure to overshoot, and it got quite slow at the same step, where it displays in the logs [Sat Jun 8 08:55:39 2024] Processed 32600000 reads (30149746 are used for assembly) then got timeout after 2 days.

My data is probably not appropriate as it is, since R1 and R2 fastq.gz are ~27G each, and most of the data within will not be IGH reads. If I align beforehand and provide bam files to TRUST4, I guess it will be able to focus on the IG regions more efficiently ? I originally wanted to avoid doing the alignment myself since most of the workflow is outsourced.

mourisl commented 5 months ago

Is it possible to upgrade to the recent version of v1.1.1? The speed on barcode-based data has been improved much since v1.1.0.

LukaP-BB commented 5 months ago

I'll try and get to you after I tested it, I assumed naïvely that conda installed the latest version. Thanks for your help ! :heart: