milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
326 stars 79 forks source link

UMI list options #696

Closed arnavaz closed 1 year ago

arnavaz commented 2 years ago

Checklist before submitting the issue:

Hello, Thanks for the update on the inclusion of the UMIs and error correction in mixcr. This is really fantastic addition to mixcr. However, we in our lab use a customized IDT commercialized list of UMIs that we were wondering if it can included in the built-in options for this version of mixcr? Thank you very much in advance. Arna

mizraelson commented 2 years ago

Hi Arna, MiXCR supports any custom UMI pattern. Briefly as an example:

mixcr align \
-s mmu \
--tag-pattern "^(UMI:N{15})(R1:*)\^(R2:*)" \
input_R1.fastq input_R2.fastq output.vdjca

This pattern means that the first 15 nucleotides of every read in input_R1.fastq will be extracted as a UMI-tag. All the reads from input_R2.fastq will be taken as is.

We are currently working on the detailed documentation which will be available soon. Meanwhile, I would be happy to help you with the pattern syntax if you can share the sequence of UMI-adaptor that you use.

Mark

arnavaz commented 2 years ago

Sure, here is the list of UMIs we use in our data. We truly appreciate your help and support. Regards, Arna

IDT_barcodes.txt

mizraelson commented 2 years ago

I see. Soon we will release an update where it will be possible to filter barcodes based on a given list. As for now you can try to use the following pattern for that barcode - (UMI:N{5}). If you can tell me what is the context of your UMI, I can help with the exact pattern to use. Like as an example:

"^(UMI:N{4})ATGCCCGTAA(R1:*) | ^(UMI:N{5})ATGCCCGTAA(R1:*) \^(R2:*)"

This pattern means that UMI group is located at the beginning (^) of R1. UMI can include 4 or 5 nucleotides followed by a known sequence ATGCCCGTAA. UMI and this sequence will be cut from R1 after alignment. R2 will be used as is.

It is worth noticing that 4-5 letters UMIs (also restricted to a particular short list of variants) most likely is not gonna be enough for the proper error correction. Diversity of this UMIs (64 different sequences) is by several orders of magnitude less that the diversity of immune receptor cDNA/DNA molecules. Unless you work with a very narrow population of cells.

arnavaz commented 2 years ago

Thanks a lot for the detailed explanation. I totally understand your concern regarding the diversity of the UMIs. The reason we are using this list in our dna profiling is that the in-house tool that we have to process the UMIs and error correction uses both the UMIs information as well as the genome location of the dna molecule. Using these two pieces of information for each read, our in-house pipeline corrects for duplicate reads and makes consensus reads using UMIs and genome locations in the sequencing data. As I discussed mixcr error correction tool with my supervisor he was wondering if the genome location factor can be incorporated in your error correction pipeline? That way the short list of our UMIs can also be helpful. I am attaching a link to the tool we use for error correction for genome, exome etc. https://academic.oup.com/nar/article/47/15/e87/5498633?login=true Appreciate your support. Regards. Arna

mizraelson commented 2 years ago

This method does work for low-frequency mutations because the diversity of a given region with its variants is relatively low compared to TCR / BCR diversity. Thus I don't see how such short UMIs can be used for immune repertoire error-correction. What this UMIs might help with is to determine clone counts more precise and we will add this feature in the upcoming update.

zznx commented 2 years ago

Hi, para --tag-pattern Can it be used for 3'RACE?

mizraelson commented 2 years ago

I don't see why not, any arbitrary pattern can be set. Does 3'RACE mean that your data contains 3' end of the C gene?

zznx commented 2 years ago

Does 3'RACE mean that your data contains 3' end of the C gene? You can think of it this way. In other words, I want to get rid of the adapter to the right of the reads,how do I do that? Thank you for your reply.

mizraelson commented 2 years ago

If I understand correctly you have an adapter sequence at the beginning of R2. If that is the case? then e.g:

^(R1:*) \ ^N{18}(R2:*)

This simple pattern will trim first 18 nucleotides from R2.

If you can describe your library structure in more details I can help you with writing the tag-pattern.

Also please do check our documentation page on tag pattern syntax

zznx commented 2 years ago

Hi, Let me make it clear that this is SE data mixcr align -s human --tag-pattern "^GTAAAA(R1:*) \ ^GTGAGTCGTATTA(R2:*)" -t 16 -f -OminSumScore=200 -OsaveOriginalReads=true input.fastq output_align.vdjca But it doesn't work

mizraelson commented 2 years ago

So your SE read has GTAAAA at 3' and GTGAGTCGTATTA at the 5'?

Then it should be:

^GTAAAA(R1:*)GTGAGTCGTATTA

If GTGAGTCGTATTA sequence is present in your actual reads in a reverse-compliment form, replace it with reverse-compliment.

zznx commented 2 years ago

Thank you very much. It's working now