sba1 / mgsa-bioc

Mirror of the mgsa Bioconductor package with full history in the master branch. Content of https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/mgsa in the bioconductor branch.
5 stars 5 forks source link

Default p prior values depend on input class which reduces reproducibility #10

Open gokceneraslan opened 7 years ago

gokceneraslan commented 7 years ago

As I also mentioned in https://github.com/sba1/mgsa-bioc/pull/9, default value of the p parameter has two different definitions.

If we call mgsa with

mgsa(c("A", "B"), list(set1 = LETTERS[1:3], set2 = LETTERS[2:4]))

then the default p parameter is seq(1 ,min(20,floor(length(sets)/3)), length.out=10)/length(sets), which results in

0.50000000 0.44444444 0.38888889 0.33333333 0.27777778 0.22222222 0.16666667 0.11111111 0.05555556 0.00000000

However if we call mgsa with same sets but using an MgsaSets object:

mgsa(c("A", "B"), new("MgsaSets", sets=list(set1 = LETTERS[1:3], set2 = LETTERS[2:4]))

then default p is seq(min(0.1, 1/length(sets)), min(0.3, 20/length(sets)), length.out=10) which corresponds to

0.1000000 0.1222222 0.1444444 0.1666667 0.1888889 0.2111111 0.2333333 0.2555556 0.2777778 0.3000000

Therefore, two runs of mgsa may produce different results depending on the class of sets parameter used (excluding the usual stochasticity of mgsa).

gokceneraslan commented 7 years ago

(you need to apply https://github.com/sba1/mgsa-bioc/pull/9 fix in order to be able to reproduce this)

sba1 commented 7 years ago

Am 2017-04-03 17:57, schrieb Gökçen Eraslan:

As I also mentioned in #9 [1], default value of the p parameter has two different definitions.

If we call mgsa with

mgsa(c("A", "B"), list(set1 = LETTERS[1:3], set2 = LETTERS[2:4]))

then the default p parameter is seq(1 ,min(20,floor(length(sets)/3)), length.out=10)/length(sets), which results in

0.50000000 0.44444444 0.38888889 0.33333333 0.27777778 0.22222222 0.16666667 0.11111111 0.05555556 0.00000000

However if we call mgsa with same sets but using an MgsaSets object:

mgsa(c("A", "B"), new("MgsaSets", sets=list(set1 = LETTERS[1:3], set2 = LETTERS[2:4]))

then default p is seq(min(0.1, 1/length(sets)), min(0.3, 20/length(sets)), length.out=10) which corresponds to

0.1000000 0.1222222 0.1444444 0.1666667 0.1888889 0.2111111 0.2333333 0.2555556 0.2777778 0.3000000

Therefore, two runs of mgsa may produce different results depending on the class of sets parameter used (excluding the usual stochasticity of mgsa).

Yes, the default should be the same so this should be changed. Patches are welcome! :) For the benchmarks in the original paper we used

1/length(sets) ... 20/length(sets)

with length(sets) being in the thousands. I guess the first variant will try to accomplish this (note that this looks weird in the example as the number of sets is unrealistically small).

Note that I always planned to add a variant to this code in which these parameters are integrated out using analytical means.

gokceneraslan commented 7 years ago

Thanks for the fast response. I can send a PR for having only one default (namely seq(1,min(20,floor(length(sets)/3)), length.out=10)/length(sets)), no problem.