vlothec / TRASH

RepeatIdentifier
MIT License
50 stars 3 forks source link

Information related to the output file #9

Open amit4mchiba opened 1 year ago

amit4mchiba commented 1 year ago

Hi,

I am writing here to seek your help in understanding the output files from the TRASH run.

For the Summary.of.repeatitive.regions file, what these terms refers to-

  1. ave.score -- Is this measure of similarity percentage as mentioned in the manuscript?

  2. most.freq.value.N-- This is the length of the repeat sequence that is proposed by TRASH?

  3. consensus.primary and consensus.secondary --- What are the differences between them? I think that this is the shifting the frame, right? Does this means that when we define repeat, it should be the consensus.primary. right?

  4. consensus.count-- This is the number of repeats or rather the number of times that it occurred?

I had another question as how to make use of TRASH results to identify new/novel HORs. I do understand that one could look for HORs using TRASH, but that is when you have a monomer, and you want to see how this repeat is presented within genome. I was wondering as how shall I choose the consensus repeat and then look for HOR analysis using that as class through "--seqt" option. One of the suggestion is "plotting histograms of the trash repeats and try and identify what the main size classes are. Usually repeats within a discrete size class belong to the same family. You can subset those repeats, define a consensus then provide that to HOR analysis in trash".

I am not sure as how to construct the consensus as the repeats are distinct and there are no class as such. I mean, in the summary table, we have concensus seq, but these are distinct, and one have the width information. So, here what we should plot if width?

I also wanted to ask if there is some means through which we can identify repeats that is conserved and present across all the chromosomes? Running TRASH means we can also get to see the simple plot, and in my case, I can see clear region where the repeats are enriched and suggesting probably centromere region. Is it possible to use the outputs and identify say top 10 repeats and what part of chromosomes they are localized?

I am so sorry for asking so many questions and would appreciate your help and advice.

with best regards Amit

vlothec commented 1 year ago

Hi, from the top:

  1. This score is a fraction of duplicated kmers in each n-kb window
  2. Yes, it is the measured periodicity of the region
  3. Yes, shifting the frame is one of them, there is also further refinement based on a second mapping. Consensus secondary is derived from repeats that end up in the all.repeats* file
  4. Yes, number of repeats as found during the run. I have to say that due to the nature of some tandem arrays, I had to implement a method that divides the region as there are cases where two distinct arrays are present and are interspersed within a region. Regions that are a result of such a split might have incorrect ave.score and consensus.count because of that. Cleaning this up is on my TODO list.

Re classifying repeats for HOR identification: technically, TRASH can use all repeats to identify HORs, the issue with this approach is that a multiple sequence alignment will not be vary accurate, as many unrelated sequences will be aligned. Additionally, it will increase the analysis time exponentially. If you have troubles identfying the main families of repeats present in your species, you could for example use a table function in R, to summarise unique sequences found in the all.repeats* file. Tandem arrays will usually contain a large number of identical repeats, so you'll find common repeats at the top of that table. It will be also easier to handle the data this way.

repeats = read.csv("all.repeats*")
unique.repeats = table(repeats$seq)

You can also use a kmer based approach to identify common repeats.

all.repeats.sequence = strsplit(paste0(repeats$seq), plit = "") # concatenate all repeats sequences into one and make into single char vector
repeats.kmers = kmer::kcount(x = all.repeats.sequence, k = 10) # identify and count 10-mers
most.common.kmer = colnames(repeats.kmers)[which.max(repeats.kmers)] # identify the most common kmer
grep(most.common.kmer, repeats$seq) # which repeats contain that kmer

This is just a beginning of the analysis, you would need to adjust it to your data and use more than just the one most common kmer, but it should give an idea of what most common repeats are.