Merge FASTQs of same sample before TRUST4

liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

MIT License

272 stars 47 forks source link

Merge FASTQs of same sample before TRUST4 #246

Closed wvictor14 closed 8 months ago

wvictor14 commented 8 months ago

Hi, thank you for the wonderful package, I expect it to be highly useful for my project. I am using TRUST4 to analyse a public 10x 5' scRNAseq dataset, which has data on SRA.

Each experimental sample (SRA experiment) has multiple SRA runs. For example, there are 4 runs for this SRA experiment:

Each run has two FASTQs R1 (barcode + umi) and R2

I am wondering for calling TCRs, if you would suggest to merge runs, so all R1 together, and all R2 together for a given experiment.

Before running trust4, or is it OK to run trust4 on each run individually?

I was thinking that if there are reads for the same barcode/cell across different runs, then it would make sense to merge them first, so that they can get assembled.

mourisl commented 8 months ago

You don't need to manually merge them. You can use a space-separated list for the -1 -2/-u/--barcode options.

wvictor14 commented 8 months ago

That's pretty convenient!

like this?

run-trust4 \
  --barcode A_fastq_1 B_fastq_1 \
  --readFormat bc:0:15 \
  -u A_fastq_2 B_fastq_2

mourisl commented 8 months ago

Right.

wvictor14 commented 8 months ago

Thank you for the prompt replies.

One last question, if I don't specify the umi range in barcode, does it automatically get decided as the remaining sequence after the barcode? in this case, 16:-1

I saw this recommended code in other issues, but just wanted to make sure.

mourisl commented 8 months ago

If you need the count in UMI, you shall also use --UMI to specify the UMI file (--UMI A_fastq_1 B_fastq_1), which are the same the barcode files in your case. You also need to add the um:16:-1 to the --readFormat option.

wvictor14 commented 8 months ago

Would you mind explaining in what cases would one want to have UMI specified? The example in the README for example does not specify umi field for 10x data. And this issue https://github.com/liulab-dfci/TRUST4/issues/217 concluded without having a umi specification in the solution.

mourisl commented 8 months ago

From my experience, if it is regular 5' scRNA-seq data, using UMI and read count to estimate each chains expression level yields very similar results. I think UMI is more crucial for amplified data, i.e. TCR-seq, V(D)J-seq, to reduce the PCR bias.

wvictor14 commented 8 months ago

I see, thank you for answering my questions, this has been very helpful!