How to set up 'assembly-clonotypes-by' argument?

My command for alignment is:

mixcr align -f -t 24 -Xmx160g -b default -p generic-amplicon-with-umi --species hsa --rna --tag-pattern "^(UMI:NNNNtNNNNtNNNN)tN{7:8}(R1:*)\^N{17}(R2:*)" --tag-parse-unstranded --rigid-left-alignment-boundary --floating-right-alignment-boundary C --assemble-clonotypes-by '[FR1,CDR1,CDR2_TO_FR4]' --json-report logs/Library.00_S1A08-UQ09-UT09.mixcr_align.json trim_demux/S1A08-UQ09-UT09_R1.fastq.gz trim_demux/S1A08-UQ09-UT09_R2.fastq.gz tmp/S1A08-UQ09-UT09.vdjca

My library is produced by a RACE with UMI protocol designed by myself, and sequenced by PE300 strategy. Thus, in theoretically, it has the full VTranscriptWithP sequence, including 5UTR, L1/L2, and VDJRegion. But there is some gap about in FR2 loci, as shown in the following:

If I use "VDJRegion," many clones would be discarded, even they still have valid sequences in CDR2_TO_FR4. However, if I use CDR2_TO_FR4, exportClones would lack sequence information for FR1/CDR1.

I tried using a mix-in option like "[FR1,CDR1,CDR2_TO_FR4]" during alignment, but the software indicated that the order was incorrect, and it seems that the software would scramble the order.

I also tried parameters like "{FR1Begin:CDR1End}+{CDR2Begin:FR4End}", but the error messages from the software were not understandable to me.

I would like to know how to handle this situation? I want to make the most of the gene regions that can be covered by the sequencing data, such as FR1+CDR1+CDR2_TO_FR4.

Additionally, I would like to ask if MiXCR supports a flexible strategy: if a clone covers a large area, the assembled region should be sufficiently wide to cover more gene features, but if the coverage area is short, then use fewer gene feature for assembly. For example, I could list possible gene features from complete to partial, like [VDJRegion, FR1+CDR2_TO_FR4, CDR2_TO_FR4], and then the software tries them one by one, which could help preserve as many clones and gene features as possible.

Looking forward to your reply :)

Hi,

Is there a reason you don’t use the mixcr analyze command?
You can set the feature to: --assemble-clonotypes-by [{FR1Begin:CDR1End},{CDR2Begin:FR4End}].
The combination of parameters --tag-parse-unstranded --tag-pattern "^(UMI:NNNNtNNNNtNNNN)tN{7:8}(R1:*)\^N{17}(R2:*)" doesn’t quite make sense because there are no anchor points to determine whether 7 or 8 nucleotides should be skipped. The same applies to --tag-parse-unstranded, as there is no sequence to determine in which read the UMI is located. It should be, for example:

^(UMI:NNNNtNNNNtNNNN)tN{7:8}atgggct(R1:*)\^N{17}(R2:*)

or simply:

^(UMI:N{14))N{7}(R1:*)\^N{17}(R2:*) without --tag-parse-unstranded, but in that case, you have to be sure the UMI is always in R1.

Regarding the “flexible strategy,” you can assemble clones by CDR3 during the assemble step preserving the alignments with -Massemble.clnaOutput=true and then use mixcr assembleContigs to extend the sequence as much as possible.

milaboratory / mixcr

How to set up 'assembly-clonotypes-by' argument? #1788