nclark-lab / RERconverge

Analysis of convergence between organismal traits and DNA/protein sequences
GNU General Public License v3.0
44 stars 26 forks source link

How to optimize parameters based on my current results? #89

Open zhxh233 opened 5 months ago

zhxh233 commented 5 months ago

Hi there,

Based on the questions raised here #88 and your kindly help, I have got some results now. At the same time, I have encountered some new problems and would like to discuss here.

  1. When I obtain the RERs, the tutorial showed a well-corrected results. My question is what level of optimization is considered good, is there an absolute quantitative value for it? Or is it better to simply choose the relatively better one based on adjustments to the three parameters(transform type, scale or not and weighted or not)? Here is a comparasion of three transform type. 图片1
  2. RERconverge uses genome-scale datasets to get the RERs. There is no problem for genes, but if my dataset contains both coding and non-coding regions (such as TF, TFBS, UCEs), would that affect this calculation since the mutation rate of non-coding regions should be different from that of coding regions ?
  3. In my test using a whole-genome cds dataset, I got 200+ genes with a p-value << 0.05, while the p.adj of them >> 0.05 ( > 0.8 in fact ), do you think it would be acceptable for me to export the raw results and attempt different correction methods?
  4. Based on question 2. Since my dataset is quite larrrrrrge, do you have any suggestions for speeding up the computations? For instance, segmenting the dataset randomly and appropriately to enable parallel processing, while still ensuring genome scale ?

Best wishes : ) Xiaohang

nclark-lab commented 5 months ago

Hello, For #1, we always recommend sqrt transform, weighted and scaled.

  1. In theory it's fine to mix regions because RERs and downstream analyses only concern variation in rate and not absolute rate. However, with respect to question 4, it would probably help to split the regions based on their type.
  2. It is fine to use whichever correction you wish. It is also worth noting that even if adjusted p-values don't strongly implicate single genes, the ranking of the genes is still informative and can contain important information based on which functions are enriched at high scores. For gene set enrichments, we usually split the top hits based on the sign of their Rho.
  3. Yes, split randomly or based on region type. Best of luck
zhxh233 commented 5 months ago

Hello, For #1, we always recommend sqrt transform, weighted and scaled. 2. In theory it's fine to mix regions because RERs and downstream analyses only concern variation in rate and not absolute rate. However, with respect to question 4, it would probably help to split the regions based on their type. 3. It is fine to use whichever correction you wish. It is also worth noting that even if adjusted p-values don't strongly implicate single genes, the ranking of the genes is still informative and can contain important information based on which functions are enriched at high scores. For gene set enrichments, we usually split the top hits based on the sign of their Rho. 4. Yes, split randomly or based on region type. Best of luck

Thank you for your prompt reply! In fact, my question 2 mainly arise from the tutorial and 2016,MBE paper said that Relative rates quantify how much faster or slower this gene changed on a given branch after factoring out the divergence on that branch resulting from parameters affecting all genes (e.g., the time since speciation, effective population size, mutation rate). However, after receiving your response and re-reading the papers and tutorials, I realized that I might have confused the mutation rate here with higher mutation rates in non-coding regions due to lack of selection pressure All in all, thank you very much for your response and suggestions. : )