hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Reproducibility of Segway results #154

Closed corinsexton closed 2 years ago

corinsexton commented 3 years ago

Hello,

I'm currently running segway with 4 histone marks (H3K4me1, H3K4me3, H3K27ac, H3K27me3) and I've completed 3 runs with 2 replicates each all with slightly different parameters as explained below:

run 1 and run 2: minibatch fraction (0.01), gaussian (1), 5 instances run 3 and run 4: minibatch fraction (0.01), gaussian (3), 5 instances run 5 and run 6: minibatch fraction (0.05), gaussian (3), 5 instances

The problem I'm encountering is that even between the technical replicates (exact same parameters) I'm having trouble getting any sort of reproducibility of results.

I know this heatmap is a lot to look at (all runs showing the signal track distribution for each of the labels), but I expected better clustering of the labels between the runs especially the technical replicates. So my question is: Is this sort of reproducibility just to be expected or is there a way to better train the model for consistency of labelling?

all_just_runs

EricR86 commented 2 years ago

The most important parameter for reproducibility in a technical replicate with Segway is SEGWAY_RAND_SEED. It needs to be set to a value of your choosing and if kept the same across replicates and if all other parameters are indeed exactly the same, produce identical results every time. We consistently run regression tests on the same seed to ensure consistency.

Setting the seed effectively sets the same initial random parameters for the model, and in the minibatch case, pick the same "randomly" selected regions.

As you've noted since the parameters are randomly selected, even if the same gaussian parameters are effectively learned across replicates, there's no guarantee on label/ordering. You might be able to roughly match up equivalent labels if you really needed to but setting the SEGWAY_RAND_SEED variable is definitely the way to go.

If you're still experiencing the issue after setting the seed however, please let me know!