Repetitive elements coordinates

nerettilab / RepEnrich2

RepEnrich2 is an updated method to estimate repetitive element enrichment using high-throughput sequencing data.

36 stars 9 forks source link

Repetitive elements coordinates #9

Closed iramai closed 5 years ago

iramai commented 5 years ago

Hi again! I would like to know if it is posible to add the genomic coordenate to each repetitive element obtained after running RepEnrich2 protocol. I don't know, maybe adding and option after the mapping step in RepEnrich.py When I have the last file after the protocol to further do the DE analysis with edgeR, I only have the names of the repetitive elements with the family and the group names, but not the coordinates to know which repetitive element is specificaly and I wanted this information to have the landscape picture in UCSC. Thanks

nskvir commented 5 years ago

Hi there, Because of the way that RepEnrich works, finding precise genomic coordinates for where repeats map is not really an option (unless you are interested only in reads that uniquely map, in which case you can just use a normal aligner and filter for this).

RepEnrich creates a list of 'pseudogenomes' composed of concatenated sequences for each type of repeat obtained from annotation from RepeatMasker. Counts are then generated based on whether reads map at least once to any of the pseudogenomes - so we can tell whether a read belongs to a family or class, but not specifically where it mapped on the genome (as it will map to multiple locations for each repeat). It is possible to refer to the RepeatMasker annotation (used to run the setup step) for the genomic locations of each instance of repeats from the classes or families of interest, but determining which particular genomic coordinate was mapped to would not be reliable.

Best, Nick

iramai commented 5 years ago

Ok, I understand. My doubt was more in line to find a way to visualize the differentially expressed repetitive elements landscapes in the UCSC. Do you know a way that I can do that? What about doing the coverage from the produced .bam file or something like that? I do not know if it sounds crazy or maybe it could be a way to perform what I have in mind. Thanks for your fast answerds,

Iraia

nskvir commented 5 years ago

Unfortunately, visualizing a landscape in UCSC or another genome browser requires a track (such as a bedfile) with specific coordinates - so if a repetitive element contains multiple instances we would once again have the issue of not knowing which location the specific read aligns to. If you have the sample_fraction_counts.txt files and you know which repetitive elements are differentially expressed, you could use the annotation for each of the elements to visualize the location of each instance together (and potentially set the same expression value for all instances), but I'm not sure how informative this would be. Bam files can be used with genome browsers as well, so this would be a little more doable with regard to what you're looking for... however the bam file that is produced by RepEnrich2 lists only uniquely mapping reads, so you would likely be excluding a significant portion of the data from your visualization.

Best, Nick

iramai commented 5 years ago

Thanks again! I will try to go further with the bam files and see what happends.

I have another question (I think it is the las one :P) about the reference genomes. I run your protocol some weeks ago, and I thought I had it very clear, but today I was revising my job and I had a doubt about the used reference genomes. I think I have get a little bit confused. On the second step you run the setup for RepEnrich. This setup folder or reference, is only use in the last step of the protocol right? I mean when doing RepEnrich.py, together with the repeat masker file as annotation file. I have get confused with the annotation file used when doing the mapping in Bowtie. The reference file used in Botwtie is the reference genome annotation of the studied organism right? (in my case mouse mm10). Can you please clarifiy this last question? Thanks again!

Iraia

nskvir commented 5 years ago

Yes, you are correct in both statements - the setup folder is used by RepEnrich.py when running the actual analysis. Bowtie alignment should also be done normally using the annotation for your studied organism.

It's important to be sure that you ran the setup step using annotation from the same studied organism as well so that all the annotation matches up. If you need to run RepEnrich on data from a different organism you would need to perform the setup step and create a second setup folder for the new organism.

Best, Nick

iramai commented 5 years ago

Thanks!!!!!