htm-community / htm.core

Actively developed Hierarchical Temporal Memory (HTM) community fork (continuation) of NuPIC. Implementation for C++ and Python
http://numenta.org
GNU Affero General Public License v3.0
147 stars 74 forks source link

Deterministic Algorithms #194

Open ctrl-z-9000-times opened 5 years ago

ctrl-z-9000-times commented 5 years ago

NuPIC should be deterministic, and should yield precisely the same results across all supported platforms. When changes to the source code cause the output to change in any way, we should know about it.

Currently the unit tests do not systemically check this. The solution to this problem should contain hard coded results which include details about the inner workings of the algorithms. The solution should also be easy to update when NuPIC's output is intentionally changed.

EDIT: Progress:

I'd like to add 3 more of these tests.

ctrl-z-9000-times commented 5 years ago

SpatialPooler test started in PR #164

ctrl-z-9000-times commented 5 years ago

I'd like to add 3 more of these tests.

dkeeney commented 5 years ago

I totally agree with this. Actually we could use better unit test coverage as well on nearly all modules. I think is it not too hard to break something and our unit tests do not tell us.

breznak commented 5 years ago

For TM, sdr classifier, and scalar encoder. Id like to make one big test for test coverage of all of these components.

can this be part of the Hotgym, MNIST tests? As those use all of the said components, so it's just an issue of recording "gold standard" at some iteration and comparing that.

I totally agree with this. Actually we could use better unit test coverage as well on nearly all modules. I think is it not too hard to break something and our unit tests do not tell us.

Better unittests, def yes. But careful with the gold standard, there will be PRs that legitimely break the "gold" (improved code, even new param, ...). But it'll be nice to add a flag "breaks_gold" and have all of these recorded thanks to the checks.

breznak commented 5 years ago

TM output test failing on (and only) Windows CI, see https://github.com/htm-community/nupic.cpp/pull/397#issuecomment-485846951

breznak commented 5 years ago

Btw, @dkeeney , when you get back, do you have idea why this would be failing on Windows? https://github.com/htm-community/nupic.cpp/pull/397#issuecomment-485846951

dkeeney commented 5 years ago

@breznak I will be out again for this weekend but I have not forgotten that I need to check into this problem.

dkeeney commented 5 years ago

"Deterministic output of TM failed!"

The test is looking for a TM output of {26, 75} and it is getting {26, 75, 525} on Windows. Debug mode and Release mode on Windows both produce the same output if given the same iteration count.

I displayed the last 100 outputs using Linux (Ubuntu) and Windows (MSVC). When you put them in separate files and diff them, there is a significant difference in the outputs. I don't know what is causing it yet but I suspect there is some difference in floating point bit resolution or how something is truncated. Hard to tell at this point.

I need to reduce the test size to isolate the point where the two results diverge.

============================================= Windows: Last 100 outputs:

e=4899 out: SDR( 2048 ) 102, 701, 1044, 1116, 1453, 1469 e=4900 out: SDR( 2048 ) 668, 701, 1044, 1116 e=4901 out: SDR( 2048 ) 102, 668, 1458 e=4902 out: SDR( 2048 ) 450, 1453, 1469 e=4903 out: SDR( 2048 ) 450, 1044, 1128, 1447, 1453, 1469 e=4904 out: SDR( 2048 ) 450, 1453 e=4905 out: SDR( 2048 ) 254, 450, 728, 1044, 1447, 1453, 1458, 1469, 1477, 1870 e=4906 out: SDR( 2048 ) 450, 1447, 1453 e=4907 out: SDR( 2048 ) 15, 254, 1069, 1447, 1870 e=4908 out: SDR( 2048 ) 15, 254, 1069 e=4909 out: SDR( 2048 ) 15, 254, 1069, 1447, 1477 e=4910 out: SDR( 2048 ) e=4911 out: SDR( 2048 ) 15, 1710, 1720, 2039 e=4912 out: SDR( 2048 ) 595, 1720, 1953 e=4913 out: SDR( 2048 ) 15, 1199, 1710, 1720, 2001, 2039 e=4914 out: SDR( 2048 ) 1635, 1953 e=4915 out: SDR( 2048 ) 327, 421, 1199, 1935, 2001 e=4916 out: SDR( 2048 ) 321, 456 e=4917 out: SDR( 2048 ) e=4918 out: SDR( 2048 ) 418, 490, 1069 e=4919 out: SDR( 2048 ) 418, 490, 1044, 1069, 1766 e=4920 out: SDR( 2048 ) e=4921 out: SDR( 2048 ) 39, 490, 1038, 1044, 1058, 1069 e=4922 out: SDR( 2048 ) 701 e=4923 out: SDR( 2048 ) e=4924 out: SDR( 2048 ) 194, 236, 701, 1710, 1724, 1940, 1960, 1980 e=4925 out: SDR( 2048 ) 1960 e=4926 out: SDR( 2048 ) 236, 701, 1710, 1940, 1960 e=4927 out: SDR( 2048 ) 1960 e=4928 out: SDR( 2048 ) 913, 923 e=4929 out: SDR( 2048 ) 335, 532, 795, 923, 926, 1239, 1670 e=4930 out: SDR( 2048 ) e=4931 out: SDR( 2048 ) 335, 532, 795, 1239 e=4932 out: SDR( 2048 ) 1038, 1044, 1048, 1064 e=4933 out: SDR( 2048 ) e=4934 out: SDR( 2048 ) 1038, 1044, 1048, 1064 e=4935 out: SDR( 2048 ) e=4936 out: SDR( 2048 ) 912, 1038, 1044, 1048, 1064 e=4937 out: SDR( 2048 ) 912, 1038 e=4938 out: SDR( 2048 ) e=4939 out: SDR( 2048 ) 1038, 1044, 1048, 1670, 1805, 1812, 1813, 1815, 1922 e=4940 out: SDR( 2048 ) 701 e=4941 out: SDR( 2048 ) e=4942 out: SDR( 2048 ) 194, 216, 1980 e=4943 out: SDR( 2048 ) 174, 561, 1980 e=4944 out: SDR( 2048 ) 1772, 1773, 2026 e=4945 out: SDR( 2048 ) 1772, 1773, 2026 e=4946 out: SDR( 2048 ) 7, 13, 1315, 1796 e=4947 out: SDR( 2048 ) 7, 13, 15, 17, 72, 1252, 1259, 1315, 1796, 1877 e=4948 out: SDR( 2048 ) 13, 17, 39, 1252, 1315, 1877 e=4949 out: SDR( 2048 ) e=4950 out: SDR( 2048 ) e=4951 out: SDR( 2048 ) e=4952 out: SDR( 2048 ) 160, 790, 859, 1867 e=4953 out: SDR( 2048 ) 160, 859, 982 e=4954 out: SDR( 2048 ) e=4955 out: SDR( 2048 ) 982, 1700, 1738, 1856 e=4956 out: SDR( 2048 ) 1700 e=4957 out: SDR( 2048 ) 961, 1700, 1738 e=4958 out: SDR( 2048 ) 230, 1107, 1685 e=4959 out: SDR( 2048 ) 124, 1107, 1946, 1976 e=4960 out: SDR( 2048 ) e=4961 out: SDR( 2048 ) 468 e=4962 out: SDR( 2048 ) 468, 923, 925, 1199, 1442, 1647, 1648, 1670 e=4963 out: SDR( 2048 ) e=4964 out: SDR( 2048 ) 879, 923, 1199, 1418, 1442 e=4965 out: SDR( 2048 ) 879, 1418, 1442 e=4966 out: SDR( 2048 ) e=4967 out: SDR( 2048 ) 348, 370, 1172, 1201, 1232, 1418 e=4968 out: SDR( 2048 ) e=4969 out: SDR( 2048 ) e=4970 out: SDR( 2048 ) 361, 733, 1257 e=4971 out: SDR( 2048 ) 361 e=4972 out: SDR( 2048 ) e=4973 out: SDR( 2048 ) e=4974 out: SDR( 2048 ) e=4975 out: SDR( 2048 ) 532, 560, 1471, 1486, 1786 e=4976 out: SDR( 2048 ) e=4977 out: SDR( 2048 ) 532, 560, 1366, 1486, 1786 e=4978 out: SDR( 2048 ) e=4979 out: SDR( 2048 ) 1085, 1366, 1591 e=4980 out: SDR( 2048 ) e=4981 out: SDR( 2048 ) e=4982 out: SDR( 2048 ) 1619, 1678 e=4983 out: SDR( 2048 ) e=4984 out: SDR( 2048 ) 963, 967, 1619, 2018 e=4985 out: SDR( 2048 ) 148, 167, 176 e=4986 out: SDR( 2048 ) e=4987 out: SDR( 2048 ) 144, 148, 167, 168, 174, 176, 197, 913, 1795 e=4988 out: SDR( 2048 ) e=4989 out: SDR( 2048 ) 1001 e=4990 out: SDR( 2048 ) 980 e=4991 out: SDR( 2048 ) 975, 980 e=4992 out: SDR( 2048 ) 975, 1813, 1815 e=4993 out: SDR( 2048 ) 975, 980 e=4994 out: SDR( 2048 ) e=4995 out: SDR( 2048 ) e=4996 out: SDR( 2048 ) e=4997 out: SDR( 2048 ) 75, 128 e=4998 out: SDR( 2048 ) 75, 128, 912 e=4999 out: SDR( 2048 ) 26, 75, 525

============================================= Linux: Last 200 outputs:

e=4899 out: SDR( 2048 ) 102, 450, 1044, 1116, 1453, 1458, 1493, 1824 e=4900 out: SDR( 2048 ) e=4901 out: SDR( 2048 ) 102, 668, 701, 1044, 1116, 1453, 1469, 1824 e=4902 out: SDR( 2048 ) 1453 e=4903 out: SDR( 2048 ) 450, 1044, 1128, 1453, 1469 e=4904 out: SDR( 2048 ) 1453, 1469 e=4905 out: SDR( 2048 ) 254, 450, 728, 1044, 1116, 1447, 1453, 1458, 1469, 1477, 1493, 1870 e=4906 out: SDR( 2048 ) e=4907 out: SDR( 2048 ) 254, 1069, 1447, 1870 e=4908 out: SDR( 2048 ) 15, 254, 1069, 1447 e=4909 out: SDR( 2048 ) 254, 1069, 1447 e=4910 out: SDR( 2048 ) 1710, 1720 e=4911 out: SDR( 2048 ) e=4912 out: SDR( 2048 ) 15, 1710, 1720, 2013, 2039 e=4913 out: SDR( 2048 ) 1199, 2001 e=4914 out: SDR( 2048 ) 327, 1635, 1953 e=4915 out: SDR( 2048 ) e=4916 out: SDR( 2048 ) 321, 327 e=4917 out: SDR( 2048 ) e=4918 out: SDR( 2048 ) 418, 490, 1058, 1069 e=4919 out: SDR( 2048 ) 1766 e=4920 out: SDR( 2048 ) 39, 418, 490, 1049, 1058, 1068, 1069 e=4921 out: SDR( 2048 ) 1038 e=4922 out: SDR( 2048 ) 701 e=4923 out: SDR( 2048 ) 1710 e=4924 out: SDR( 2048 ) 194, 236, 701, 1710, 1724, 1940 e=4925 out: SDR( 2048 ) 1751, 1946, 1960 e=4926 out: SDR( 2048 ) 701, 1710, 1940, 1960 e=4927 out: SDR( 2048 ) 926 e=4928 out: SDR( 2048 ) 913, 923 e=4929 out: SDR( 2048 ) 335, 532, 795, 923, 926, 1239, 1670 e=4930 out: SDR( 2048 ) e=4931 out: SDR( 2048 ) 335, 532, 795, 921, 1239 e=4932 out: SDR( 2048 ) 1038, 1048, 1064 e=4933 out: SDR( 2048 ) e=4934 out: SDR( 2048 ) 1038, 1048, 1064, 1068, 1745 e=4935 out: SDR( 2048 ) e=4936 out: SDR( 2048 ) 1038, 1044, 1048 e=4937 out: SDR( 2048 ) 912, 1044 e=4938 out: SDR( 2048 ) e=4939 out: SDR( 2048 ) 1038, 1044, 1048, 1670, 1796, 1805, 1812, 1813, 1922 e=4940 out: SDR( 2048 ) 701 e=4941 out: SDR( 2048 ) e=4942 out: SDR( 2048 ) 194, 216, 1980 e=4943 out: SDR( 2048 ) 561, 1980 e=4944 out: SDR( 2048 ) 1772, 1773, 1796, 2026 e=4945 out: SDR( 2048 ) 1796, 2026 e=4946 out: SDR( 2048 ) 7, 13, 17, 1315, 1796 e=4947 out: SDR( 2048 ) 7, 13, 15, 17, 72, 1252, 1259, 1315, 1796, 1877 e=4948 out: SDR( 2048 ) 13, 17, 39, 1252, 1877 e=4949 out: SDR( 2048 ) e=4950 out: SDR( 2048 ) e=4951 out: SDR( 2048 ) e=4952 out: SDR( 2048 ) 160, 859, 1867 e=4953 out: SDR( 2048 ) 160, 859, 982, 1867 e=4954 out: SDR( 2048 ) e=4955 out: SDR( 2048 ) 982, 1700, 1738 e=4956 out: SDR( 2048 ) 961, 1700 e=4957 out: SDR( 2048 ) 961 e=4958 out: SDR( 2048 ) 230, 1107, 1685 e=4959 out: SDR( 2048 ) 1107, 1976 e=4960 out: SDR( 2048 ) 124, 1976 e=4961 out: SDR( 2048 ) 1648 e=4962 out: SDR( 2048 ) 468, 923, 925, 1199, 1442, 1670 e=4963 out: SDR( 2048 ) e=4964 out: SDR( 2048 ) 879, 923, 1199, 1418, 1442, 1458 e=4965 out: SDR( 2048 ) 879, 1442, 1458 e=4966 out: SDR( 2048 ) e=4967 out: SDR( 2048 ) 348, 370, 1201, 1232 e=4968 out: SDR( 2048 ) e=4969 out: SDR( 2048 ) e=4970 out: SDR( 2048 ) 361, 733, 1257 e=4971 out: SDR( 2048 ) 361 e=4972 out: SDR( 2048 ) e=4973 out: SDR( 2048 ) 42 e=4974 out: SDR( 2048 ) e=4975 out: SDR( 2048 ) 532, 1467, 1471, 1486, 1786 e=4976 out: SDR( 2048 ) e=4977 out: SDR( 2048 ) 532, 560, 1366, 1486, 1786 e=4978 out: SDR( 2048 ) e=4979 out: SDR( 2048 ) 1085, 1366, 1591 e=4980 out: SDR( 2048 ) e=4981 out: SDR( 2048 ) e=4982 out: SDR( 2048 ) 1619, 1670, 1678, 2018 e=4983 out: SDR( 2048 ) e=4984 out: SDR( 2048 ) 963, 967, 1619, 1678, 2018 e=4985 out: SDR( 2048 ) 148, 167 e=4986 out: SDR( 2048 ) e=4987 out: SDR( 2048 ) 144, 148, 157, 164, 167, 168, 176, 197, 913, 1795 e=4988 out: SDR( 2048 ) e=4989 out: SDR( 2048 ) 1001 e=4990 out: SDR( 2048 ) 975, 980 e=4991 out: SDR( 2048 ) e=4992 out: SDR( 2048 ) 80, 975, 1805, 1813 e=4993 out: SDR( 2048 ) e=4994 out: SDR( 2048 ) e=4995 out: SDR( 2048 ) e=4996 out: SDR( 2048 ) 42 e=4997 out: SDR( 2048 ) 75, 128 e=4998 out: SDR( 2048 ) 75, 305, 912 e=4999 out: SDR( 2048 ) 26, 75

breznak commented 5 years ago

I don't know what is causing it yet but I suspect there is some difference in floating point bit resolution or how something is truncated. Hard to tell at this point.

yes, I'd be suspecting: Random, float rounding, tie-breakers in TM?

dkeeney commented 5 years ago

I confirmed that all SP outputs are the same. So it is someplace in TM. The TM outputs are not different until iteration 2424.

So now I have to go inside the algorithm and see what is different at iteration 2424.

dkeeney commented 5 years ago

I think I found the problem --- or at least a problem. TemporalMemory.cpp, line 284 in growSynapses( ) .

  const size_t overrun = (connections.numSynapses(segment) + nActual - maxSynapsesPerSegment);
  if (overrun > 0) {
    destroyMinPermanenceSynapses(connections, rng, segment, static_cast<Int>(overrun), prevWinnerCells);
  }

overrun is unsigned long so the subtraction assigns a negative number which is interpreted as greater than 0. destroyMinPermanencesSynapses( ) is always called. But because it is cast to int, inside the function it is a negative number and nothing actually gets deleted. But its doing a lot of work that it does not need to do.

I changed overrun to be type Int. That stopped the unnecessary calls to destroyMinPermanencesSynapses( ) but we still diverge in the same place.

During iteration 2423 the second call to growSynapses( ); the random numbers are still matching but the candidates returned have different values. See for loop at line 293.

breznak commented 5 years ago

@dkeeney

I changed overrun to be type Int. That stopped the unnecessary calls to destroyMinPermanencesSynapses( ) but we still diverge in the same place.

can you please submit a small PR with said fix?

dkeeney commented 5 years ago

I am still hunting this down. Windows and Linux deviate at the end of iteration 2422 While processing column 908, In ActivateCells( ) The Linux code calls BurstColumn( ) three times. Windows code calls BurstColumn( ) four times. That last time finds cell 7269 as the winner and it creates an extra segment.

I have not figured out what causes it....

dkeeney commented 5 years ago

Some progress. In SpatialPooler.cpp, line 875 in inhibitColumnsGlobal_( )

  // Sort the columns by the amount of overlap.  First make a list of all of the
  // column indexes.
  activeColumns.reserve(numColumns_);
  for(UInt i = 0; i < numColumns_; i++)
    activeColumns.push_back(i);
  // Compare the column indexes by their overlap.
  auto compare = [&overlaps_](const UInt &a, const UInt &b) -> bool
    {return overlaps_[a] > overlaps_[b];};
  // Do a partial sort to divide the winners from the losers.  This sort is
  // faster than a regular sort because it stops after it partitions the
  // elements about the Nth element, with all elements on their correct side of
  // the Nth element.
  std::nth_element(
    activeColumns.begin(),
    activeColumns.begin() + numDesired,
    activeColumns.end(),
    compare);
  // Remove the columns which lost the competition.
  activeColumns.resize(numDesired);
  // Finish sorting the winner columns by their overlap.
  std::sort(activeColumns.begin(), activeColumns.end(), compare);
  // Remove sub-threshold winners
  while( !activeColumns.empty() &&
         overlaps[activeColumns.back()] < stimulusThreshold_)
      activeColumns.pop_back();

The output of this section on EPOCH 2422, Linux: activeColumns out: 1038 1064 912 34 1095 1044 1449 40 1332 1051 Windows: activeColumns out: 1038 1064 912 34 1095 1044 1449 40 1332 908 Note the last element.

Just prior to this, the output of the nth_element( ) partial sort has the elements partitioned correctly but the exact order between Linux and Windows is slightly different. So when we truncate we just happen to get a different value for the 10th element. That difference propagates and eventually everything is different between Linux and Windows.

The Problem: So, I think the problem is that the compare function does not uniquely specify the order. Items that compare equal can be in any order.

The Fix:

  // Compare the column indexes by their overlap.
  auto compare = [&overlaps_](const UInt &a, const UInt &b) -> bool
    {return overlaps_[a] > overlaps_[b];};

becomes

  // Compare the column indexes by their overlap.
  auto compare = [&overlaps_](const UInt &a, const UInt &b) -> bool
    {return (overlaps_[a] == overlaps_[b]) ? a > b : overlaps_[a] > overlaps_[b];};
breznak commented 5 years ago

Big KUDOs @dkeeney ! This has been really a huge and lengthy investigation. Great job! :100:

breznak commented 5 years ago

If we add Classifier (+Predictor) to hotgym, would you consider this issue resolved? With hotgym being the single place to run tests of all (most) components and check that the outputs are deterministic.

ctrl-z-9000-times commented 5 years ago

If you add determinism tests for the predictor then yes, i would close this issue.

breznak commented 4 years ago

ARM platform on muslC is not "deterministic. TODO: