Run-to-run reproducibility issue

markmipt commented 6 years ago

First of all, thank you for developing such great software!

I've noticed that results slightly change from run to run. For example, on my workflow applied to iPRG2015 data, in a half of diffacto runs I get only 6 true positives protein identifications passed the p-value threshold with Bonferroni correction. On the other half of runs, I get 6 true positives + 1 false positive protein. I do not change input file and parameters for the diffacto, just only start the script again. The MC simulation is turned off. Is it possible (and is it correct?) to put somewhere in the code (or as optional parameter) fixed random seed or something like this?

I can share the input files and parameters but it seems to be reproducible at any dataset.

P.S. Should the proteins with negative S/N ratio be excluded from the results? On the standard with known protein concentrations it seems that true positives always have a positive S/N ratio and the false positives have a negative S/N.

Regards, Mark

userbz commented 6 years ago

Dear Mark,

Thank you very much for the comment and your performance test. My guess is that the randomness comes from the EM step when adding minimum squared_noise to the matrix.

As described in the paper, "The signal-to-noise ratio (S/N) is then estimated for every group of peptides attributed to a single protein, to determine whether this group is informative, or too contradictory to reliably quantify." We applied a cutoff of -20 dB to exclude unreliable quantifications. Excluding all negative S/N might be too strict. However, this cutoff should be adjusted while changing the settings of mu or alpha. Simply plotting a histogram may help you find the boundary between informative of uninformative quantifications. Proteins with only two or three peptides are more often to failed the S/N filter.

I will look into the this reproducibility issue, and try to come up with a solution.

Thank you and best regards, Bo

vnaum commented 6 years ago

Not sure if its same issue or not - for me re-running same command multiple times can change some values significantly, like, AATC 17 4 3.530688328587969 can turn into AATC 17 14 -2.7282185275367112

Thats straight from "examples" folder and

python ../run_diffacto.py -i iPRG.novo.pep.csv -samples iPRG.samples.lst -out iPRG.denovo.protein1.txt -mc_out iPRG.denovo.protein.FDR -min_samples 4 -impute_threshold 0.9 -use_unique True -log2 False
python ../run_diffacto.py -i iPRG.novo.pep.csv -samples iPRG.samples.lst -out iPRG.denovo.protein2.txt -mc_out iPRG.denovo.protein.FDR -min_samples 4 -impute_threshold 0.9 -use_unique True -log2 False

commands.

Thats what vimdiff looks like: difference

userbz commented 6 years ago

Hi Vladislav, Yes, there is an issue with reproducibility presumably because of the EM steps. However, as mentioned in the reply to Mark, S/N is also an important threshold to control these unreliable results. As you know, iPRG data have a common background proteome. In that sense, none of the proteins except the six spike-in markers should be significantly differential.

/Bo

statisticalbiotechnology / diffacto

Run-to-run reproducibility issue #4