liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
278 stars 49 forks source link

How to use fastq-extractor to deal with paired-UMI sequencing data in trust4 suite #255

Open yqyuhao opened 7 months ago

yqyuhao commented 7 months ago

Dear editor How to use fastq-extractor to deal with paired-UMI sequencing data in trust4 suite? My library structure is 3M3S+T,3M3S+T.

mourisl commented 7 months ago

Sorry, I don't quite get your question. What is the paired-UMI sequencing? What's the meaning of 3M3S+T?

yqyuhao commented 7 months ago

Yes, I used to KAPA universal UMI adapter to prepare the library. Universal UMI adapter for ligation-based library construction prior to sample barcoding in the KAPA HyperCap workflow and KAPA HyperPETE Workflow SomaticTissue DNA, KAPAHyperPETE Workflow Somatic Plasma cfDNA. Usually, we use fgbio software to deal the raw sequencing data, The read structure is defined as 3M3S+T. Extract the first (3) bases off the start of the read (3S). These bases constitute a punctuation sequence that increases the sequence diversity to ensure optimal sequencing performance. Maintain the remaining sequence as the part of the insert read (+T). The UMIs extracted from read 1 and read 2 are stored in the RX tag of the unmapped BAM file as UMI1-UMI2.

mourisl commented 7 months ago

Do you mean the UMI is 6bp, where the first 3 bp are from the beginning of read1 and the other 3bp come from the beginning of read2?

yqyuhao commented 7 months ago

Yes, I don't know how to use fastq-extractor to deal with this data.

mourisl commented 7 months ago

The current implementation cannot handle barcode/UMI across two files. You may need to implement your own script to reformat the files. I will think about how to implement it with the current framework in the future.

yqyuhao commented 7 months ago

Based on the present situation,how can I reformat the files to meet the requirement of the fastq-extractor tools?

mourisl commented 7 months ago

You can extract and concatenate the 3bp from each read to create another file as "XXX_barcode.fq". You shall also remove those UMI sequences from read sequences. With those you can run TRUST4 from the wrapper "run-trust4" with the extra option like "--barcode XXX_barcode.fq --barcodeLevel molecule".

yqyuhao commented 7 months ago

Thank you for providing the information. In the scenario you described, using the UMI from read1 as the barcode sequence and the UMI from read2 as the UMI sequence, while considering only the UMI from read2 as a group during the assembly process, indeed neglects the role of the UMI from read1. This approach may not be applicable in all scenarios as UMIs (Unique Molecular Identifiers) are typically used to mark read pairs that originated from the same original molecule, facilitating their distinction in subsequent analyses.

In a standard UMI processing workflow, the UMIs from read1 and read2 (or additional read pairs, if applicable) should be consistent, allowing them to be used to group all reads originating from the same molecule. This grouping is crucial for accurate data processing in subsequent steps such as deduplication and error correction.

Ignoring the UMI from read1 and only considering the UMI from read2 can lead to the loss of important information, compromising the accuracy of data processing. For instance, if you attempt to deduplicate based on the UMI, only considering the UMI from read2 could mistakenly consider read pairs from the same molecule as distinct.

Therefore, when designing and implementing a UMI analysis workflow, it is essential to ensure that both the UMI from read1 and read2 (and additional read pairs, if applicable) are properly utilized and considered. This ensures data integrity and accuracy, leading to more reliable analysis results.

mourisl commented 7 months ago

Sorry for the confusion. I mean you need to concatenate the UMI portion from the two reads into one, and then dump them into one fasta file. This file will essentially be the UMI file.

The --UMI option in TRUST4 is for abundance estimation only for data like 10x Genomics. In your case, the UMI is served as molecule-level barcode. Therefore, the appropriate option is regarding this file as a barcode, and then use the option "--barcodeLevel molecule" to specify this is a true UMI.

Just curious, with 6bp UMI, it is very easy to have UMI conflict (two molecules use the same UMI), is it a concern in your data?