hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
80 stars 23 forks source link

Confusion about mapping reads to repetitive elements. #36

Closed broochpawN closed 4 years ago

broochpawN commented 4 years ago

Hi,

Thank you for your efforts to generate and maintain this good package, as well as the wiki documents about repbase. May I have a question about this?

Based on my understanding, the repetitive elements will be under consideration during the process. But I found that in the hg.fa file in the 'scripts' directory, besides Homo sapiens, there are other species like Mammalia, Eutheria, Primates, etc. Why these species are also included here. I would expect we only consider the repetitive elements in homo sapiens.

I guess this would cause overestimate of the reads from repetitive elements because some of them may be mapped to other species. Does this matter? For example, I searched AGGCGGGCGGATCACGAGG in hg.fa file and I found so many matches in Primates; cgtagtggcgggcgcctgtagtcctagctacttgggaggctgaggcaggagaatggcgtgaacccgggag would be only mapped to ">AluYa8 SINE1/7SL Primates" but not the homo sapiens.

Thus, I have a worry about this problem.

Looking forward to your reply.

Thanks! Have a nice day!

hyunhwan-jeong commented 4 years ago

Correct me if I am wrong, but I believe some repeat elements were inherited from ancestors (or Primates). In particular, AluYa8, what you mentioned in your issue, is in the human genome. Here is an example showing the repeat was found in the human genome: https://genome-euro.ucsc.edu/cgi-bin/hgc?c=chr7&l=65847413&r=65847483&o=65847313&t=65847623&g=rmsk&i=AluYa8

In other words, the human genome has some Primates repeats, and I would say it is fine to include them. You can omit some of them, but if you want to only include Homo Sapience and need my help, please let me know. I will be glad to help you.

Best Regards,

Hyun-Hwan Jeong

broochpawN commented 4 years ago

Hi Hyun-Hwan, sorry for the late reply. I appreciate your reply and your kind help.

I agree with you that the human genome has some repeats similar to the primate or other mammalian genomes. Actually, I just have a concern:

I am using STAR to map the reads to the repetitive elements. There is a parameter --outFilterMultimapNmax (default is 20 in STAR) to determine how many loci are allowed for a read to be mapped. If exceeded, the read would be considered as unmapped. Therefore, imagine that we have a read from repetitive regions and have 2 repetitive reference genomes, (1) only contains 10 homo sapiens repetitive regions, and (2) contains 10 repetitive regions and 30 from other similar genomes. A read may have 5 alignments to the first genome and 25 alignments to the second one. Under the parameter --outFilterMultimapNmax 20, the read is considered to be from repetitive elements if using the first reference, but not from the repetitive elements if using the second reference (because this read would be considered as unmapped to the reference).

Thus, to avoid that a read is considered as unmapped to the repetitive reference genome due to too many alignments, I am thinking about if I will make this parameter bigger if I map my reads to a repetitive reference (like yours) containing many similar elements. This is why I ask you if the repetitive elements from other species matter.

Besides, I can get the subset of Homo Sapiens repetitive elements and build the reference by myself. Thank you! You are so kind! If possible, could I use the hg.fa file in your GitHub, please? I mean if I want to only consider the homo sapiens repetitive elements and get a subset from your file.

Best Regards, Sihan

hyunhwan-jeong commented 4 years ago

@broochpawN,

I guess this could be a different story. What is your reference sequences? Is it a genome or a set of repeats?

You are free to use hs.fa file in the Github, but please be sure it is originally from RepBase.

Thank you,

Hyun-Hwan Jeong

broochpawN commented 4 years ago

@hyunhwaj

Hi! My reference sequences are just repeats bc I want to remove the reads from repetitive elements. I plan to map the reads to repeats first and remove the mapped ones, and then use the unmapped ones for the following analysis.

Thanks for your kind share of the data!

Best, Sihan