mhalushka / miRge3.0

Comprehensive analysis of small RNA sequencing data
MIT License
28 stars 11 forks source link

UMI #86

Open hamzaaitabbou96 opened 11 months ago

hamzaaitabbou96 commented 11 months ago

Hi Arun,

Can you explain to me how to remove PCR duplicates and UMI ? and share with me the libraries and how to use them to do this. I want to do only this part "remove PCR duplicates and UMI".

arunhpatil commented 11 months ago

Hi @hamzaaitabbou96,

Yes, UMIs that I have come across are of two types, Illumina and Qiagen. Illumina based (4N) UMIs have four random nucleotides on either side of the template (miRNA), followed by 3' adapter sequence. I have explained how miRge3.0 removes PCR duplicates with example here.

We have a FAQ about UMI, that can be found here.

The way it is implemented in miRge3.0 is that it is integrated along with cutadapt adapter removal process. If you can follow the functions code - here, you can get an idea.

Short example I can give here for Illumina 4N: AGTGTGAGGTAGTAGGTTGTATAGTTCTACADAPTERSEQ AGTGTGAGGTAGTAGGTTGTATAGTTCTACADAPTERSEQ AGTGTGAGGTAGTAGGTTGTATAGTTCTACADAPTERSEQ AGGGTGAGGTAGTAGGTTGTATAGTTCATAADAPTERSEQ

In the above example, there are four reads with a miRNA sequence (highlighted). At first, you trim the adapter sequence as it is not required. Then you are left with 4N UMIs on either end of the sequence.

If you look closely, there are three PCR duplicates, so record that as AGTGCTAC,TGAGGTAGTAGGTTGTATAGTT - 3 AGGGCATA,TGAGGTAGTAGGTTGTATAGTT - 1 Now, assuming these are PCR duplicates, the total read counts for miRNAs in this example are 2 and not 4.

The approach is similar in Qiagen, but the UMI length is 12 nucleotides for Qiagen reads which is much better than 4N method. In Qiagen, the UMI is sandwitched between two adapters at the 3' end of miRNAs. TGAGGTAGTAGGTTGTATAGTTAACTGTAGGCACCATCAATAGTGCTACCATAADAPTERSEQ

In this case we know the Qiagen adapter length followed by 12N UMIs, so, we trim the adapter, fetch the UMI and add it back to the miRNA as shown below: TGAGGTAGTAGGTTGTATAGTT AGTGCTACCATA

Now we count for duplicates with same miRNA sequence and UMI pair. If you follow supplementary Fig3, you will get a clear picture.

Thank you, Arun.