milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
336 stars 79 forks source link

How to set up 'assembly-clonotypes-by' argument? #1788

Closed omegahh closed 2 months ago

omegahh commented 2 months ago

My command for alignment is:

mixcr align -f -t 24 -Xmx160g -b default -p generic-amplicon-with-umi --species hsa --rna --tag-pattern "^(UMI:NNNNtNNNNtNNNN)tN{7:8}(R1:*)\^N{17}(R2:*)" --tag-parse-unstranded --rigid-left-alignment-boundary --floating-right-alignment-boundary C --assemble-clonotypes-by '[FR1,CDR1,CDR2_TO_FR4]' --json-report logs/Library.00_S1A08-UQ09-UT09.mixcr_align.json trim_demux/S1A08-UQ09-UT09_R1.fastq.gz trim_demux/S1A08-UQ09-UT09_R2.fastq.gz tmp/S1A08-UQ09-UT09.vdjca 

My library is produced by a RACE with UMI protocol designed by myself, and sequenced by PE300 strategy. Thus, in theoretically, it has the full VTranscriptWithP sequence, including 5UTR, L1/L2, and VDJRegion. But there is some gap about in FR2 loci, as shown in the following:

截屏2024-09-12 09 21 13

If I use "VDJRegion," many clones would be discarded, even they still have valid sequences in CDR2_TO_FR4. However, if I use CDR2_TO_FR4, exportClones would lack sequence information for FR1/CDR1.

I tried using a mix-in option like "[FR1,CDR1,CDR2_TO_FR4]" during alignment, but the software indicated that the order was incorrect, and it seems that the software would scramble the order.

截屏2024-09-12 09 30 54

I also tried parameters like "{FR1Begin:CDR1End}+{CDR2Begin:FR4End}", but the error messages from the software were not understandable to me.

截屏2024-09-12 09 37 50

I would like to know how to handle this situation? I want to make the most of the gene regions that can be covered by the sequencing data, such as FR1+CDR1+CDR2_TO_FR4.

Additionally, I would like to ask if MiXCR supports a flexible strategy: if a clone covers a large area, the assembled region should be sufficiently wide to cover more gene features, but if the coverage area is short, then use fewer gene feature for assembly. For example, I could list possible gene features from complete to partial, like [VDJRegion, FR1+CDR2_TO_FR4, CDR2_TO_FR4], and then the software tries them one by one, which could help preserve as many clones and gene features as possible.

Looking forward to your reply :)

mizraelson commented 2 months ago

Hi,

^(UMI:NNNNtNNNNtNNNN)tN{7:8}atgggct(R1:*)\^N{17}(R2:*)

or simply:

^(UMI:N{14))N{7}(R1:*)\^N{17}(R2:*) without --tag-parse-unstranded, but in that case, you have to be sure the UMI is always in R1.