Purpose of Certain Parameters?

YogiOnBioinformatics commented 2 years ago

Hello @guanjue,

I wanted to understand the purpose of certain parameters.

cap I just set this to the highest number in my dataset. That seems a bit strange to do though since there may be some important usage of this parameter I don't understand.
train Is it better or more accurate to set this at a higher number? Obviously, I would expect increasing this to increase computational time.
trainsz What exactly are the units for it? What does the default parameter value 500000 mean?
norm Is there any situation you think it's best to use standardized values? Any reason the default is not to standardize?
smooth What does it mean to make "states more homogenous along the genome"?
burnin Are higher numbers better overall? I realize this will increase computational time.

guanjue commented 2 years ago

Hi, Yogindra

For these parameters: 1, "cap" is the upper threshold of the signal. The signal greater than the cap will be set as the cap signal. We usually set it to 16 because we converted our signal to -log10(p-value) based on local adjusted negative binomial model. So the cap equals to 16 corresponds to p-value = 1e-16.

2, More train number will output a more robust epigenetic state set.

3, The IDEAS randomly select some 500000*bin-size (default bin-size is 200bp) region to initialize the model.

4, Because we already normalized our data before running IDEAS, we usually not use the IDEAS's internal "norm". One thing to be cautious is that the scale function may inflate the noise of the sample with weaker signals.

5, It means the signal is smooth along the genome so that a state transition between each 200bp bin along the genome will be less noisy.

6, I only used the default value for this parameter before, and I did not evaluated what is the effects of change this parameter.

Best wishes. Guanjue

YogiOnBioinformatics commented 2 years ago

Thanks so much for this! As always, I really do appreciate it.

Is there any reason why smooth isn't set to 1 by default?

As well, is there any place that explains what the following output files mean?

*.cluster
*.para
*.profile
*.state

YogiOnBioinformatics commented 2 years ago

Hello @guanjue,

Apologies, don't mean to bother. Just wanted to follow up on this. 🙂

guanjue commented 2 years ago

Sorry for the late reply. These are some descriptions about these output files:

*.state: The Epigenetic states and position classes for the genome across all input cell types. First 4 columns are index, chr, position_st position_ed (position_ed will be the same as position_st if only one position for each window is provided in input) The next X columns are epigenetic states, where X=total number of cell types, including replicates. The last column is the position class label in IDEAS local clustering.

*.para: The first column in the Frequency of the state. The next N columns are the sum and variance parameters for each epigenetic states.

*.cluster file: Local cell type clustering result, one row for each cell type.

*.profile: This file can be ignored.

Best wishes. Guanjue

On Thu, Mar 10, 2022 at 9:33 AM Yogindra Raghav @.***> wrote:

Hello @guanjue https://github.com/guanjue,

Apologies, don't mean to bother. Just wanted to follow up on this. 🙂

— Reply to this email directly, view it on GitHub https://github.com/guanjue/IDEAS_2018/issues/14#issuecomment-1064127765, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3XPDEMXOUGCZUFKP6GFZ3U7IB45ANCNFSM5QEZOJAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

YogiOnBioinformatics commented 2 years ago

Thanks so much for the reply! Here's my understanding of how things work.

Once IDEAS is done running, I need to create the heatmap so I can understand what state number corresponds with which biological event. Is the above correct?

As well, I had a question about this: The last column is the position class label in IDEAS local clustering.

What exactly does that mean in lay mans terms?

guanjue commented 2 years ago

That column is show the cell-types clusters of each 200bp bin (it is try to cluster all 200bp bins into clusters based on the epigenetic profiles across multiple cell types. details can be found in the IDEAS-2016 paper). For the purpose of generating the epigenetic state tracks, this can be ignored.

YogiOnBioinformatics commented 2 years ago

Closing this issue.

guanjue / IDEAS_2018

Purpose of Certain Parameters? #14