opencog / atomspace

The OpenCog (hyper-)graph database and graph rewriting system
https://wiki.opencog.org/w/AtomSpace
Other
798 stars 225 forks source link

Should we use Atoms inside DistributionalValues #1884

Open rTreutlein opened 5 years ago

rTreutlein commented 5 years ago

I have been working on #833 which are now only called DistributionalValues and there is one big issues atm that needs to be decided. Namely, if we should use Atoms to define the intervals of a DV.

Why would we want this? If we have a distribution that describes the height of a population of 10,000 People like this:

ExecutionOutput <{[0cm,100cm):2000,
                  [100cm,160cm):3000,
                  [160cm,240cm):5000}>
  Schema "in-cm"
  Concept "height"

one way to define the intervals i.e. [0cm,100cm) would be using Atoms like this:

QuantityLink
  LeftClosedRightOpenIntervalLink
    NumberNode 0
    NumberNode 100
  UnitNode "cm"

Note: QuantityLink and UnitNode are currently undefined types and used for expository purposes.

But if we try doing it like this we run into a Problem namely we can only create DVs as long as we have a reference to the AtomSpace. Trying to create Atoms that are not in any AtomSpace is problematic. First of all it just seems wrong and second there is code in the guile interface that explicitly prohibits the passing of Atoms which are not in any AtomSpace. So we need a reference to the AtomSpace unfortunately not all classes that might want to create DVs have that. And personally, that makes sense to me that you should be able to create Values of whatever kind without needing an AtomSpace.

The alternative to this is that we store all the data directly in the DV and just add an extra field for the Unit and maybe Range. This should also be a more efficient solution to store the data. On the downside, this solution might be less flexible though atm I can't come up with an example that's only possible using Atoms for intervals.

Overall I am leaning towards the second solution. @ngeiswei, on the other hand, would prefer his original solution.

linas commented 5 years ago

Several knee-jerk comments:

1) indeed, atoms should not be used outside of atomspaces; its deeply problematic when they are. But then, hey, Values were explicitly designed to be "atoms that do not live in an atomspace".

2) do you really, really need to store the units so closely to the pairs? Do you really have to tag each pair of numbers with the unit? Wouldn't it be better to just tag the marginal? That is, in your example, each row is tagged with "inches-height", and each column is tagged with "people-count".

3) Rather than inventing an explicit, low-level hard-coded representation, maybe it would be better to implement an API to access the values? So, object-oriented design principles: "here is the method that gives me the left side of the interval", "here is the method that gives me the right side of the interval", "here is the method that gives me the count in that interval", "here is the method that returns the units used for the intervals", etc. Then, once you've defined this API, the user can put their data where-ever they want to, where-ever it is convenient for them. Meanwhile, you, the tool-developer, always know where to find the left and right ends of the interval, the count, the units, no matter what. And you write tools, assuming those methods return the correct values. Meanwhile, the user promises to return correct values.

For example, where you wrote:

 <{[0cm,100cm):2000,
                  [100cm,160cm):3000,
                  [160cm,240cm):5000}>

Maybe those numbers are not yet known at run-time. So, your tool calls the OO API, and says, "give me the count for the first interval", and that method goes off and performs some neural-net calculation, (or looks it up on wikipedia), and returns "2000". Now you call the method to say "hey given me the count for the second interval", and it eventually returns "3000", and then "5000".

This allows the user to store those numbers wherever they feel like (maybe they're in some triplestore); this allows those numbers to be computed at run-time. However, your tool, that actually does stuff with those values, that tool will work, no matter where those values are stored.

As mentioned elsewhere, this is exactly how Gnu R works; we too can meet that standard. It's also how the "matrix" code works.

In case it's not clear: Your example has rows and columns. It's not just a pure diagonal. Let me make this clear by altering it slightly:

 <{[10cm,120cm):2000,
                  [80cm,180cm):3000,
                  [140cm,240cm):5000}>

which is what a real-world sieve would generate (e.g. sorting gravel, sorting salmon/crabs, sorting eggs, etc. stuff that actually gets sorted in real life factories) That's because real-world sieves cannot precisely guarantee the lower and upper bounds.

linas commented 5 years ago

One more highly-theoretical comment about the non-overlapping values in the example

<{[10cm,120cm):2000,
                  [80cm,180cm):3000,
                  [140cm,240cm):5000}>

The axioms of probability theory are identical to the axioms of measure theory: the two theories are the same theory, just using different nomenclature. Understanding measure theory is nice, because probability theory can be confusing, sometimes, so its useful to see how things work in terms of measures. Anyway ...

Measure theory is built on the foundations of topology (more precisely, on sigma-algebras). A topology (sigma-algebra) is a collection of open sets (clopen sets) for which intersection and union are always defined (sigma-algebras also define/allow set-complement, but are otherwhise "the same thing").

A measure (probability measure) or distribution is just a (non-negative) real number associated with each open/clopen set, having a few basic properties under set intersection and set union (which I won't list here, but they are the "obvious" ones, like the measure of intersected sets is always smaller; the measure of a union of disjoint sets is always a sum, etc. ).

Absolutely NOTHING in the definition of a topology or a sigma algebra states that the sets have to be non-intersecting. In fact, you do not even need to list all of the sets: it is enough to specify the "base of the topology", or even less: a "sub-base" is enough, because it will "generate" the entire topology. Nothing says that the elements of the sub-base have to be non-intersecting, disjoint.

Also, nothing states that things have to be one-dimensional. There are 2D distributions as well:

<{[10cm,20cm)x(10cm,20cm):2000,
    [10cm,20cm)x(50cm,80cm):3000,
     [140cm,240cm)x(50cm,80cm):5000}>

presumabley describes some rectangular tiles of some 2D surface, and the number of items in each tile.

Of course, tiles do not have to be rectangular... the general structure for your distribution is

<{ (some-set, some-numeric-value),
    (another-set, another-numeric-value),
   ...
  }>

which should now be recognizable as a vector. This vector is a probability when the sets can be intersected and unioned, and the measure of the union of all of them is 1.0.

The object-oriented API that I am suggesting has two basic methods, one to get the set, one to get the numeric value. As long as the user implements these two methods, the rest of your GDTV code-base can do whatever it is that it needs to do. The "set" will just be some atom, whatever the user wants, and the numeric value will just be a single float.

Based on practical experience with the matrix code, its useful to provide marginals (so.e.g if the number is a count, then total all the counts, and divide by the total, to get the frequency p. Next, log p is useful. And so is p log p, and etc. which is why there's a lot of stuff there.)

Based on practical experience, I have only partly fiddled with set-intersection and set-union. For ordinary measure-theory/probability-theory, intersection and union are "almost trivial", but for general topoi, they are not, i.e. when "some numeric value" is replaced by "some function". In this more general case, probability theory is insufficient; and the more general axioms of algebraic topology are required. These axioms are more complex, but they are required to describe language/biology/nature (contrary to the delusions of MIRI and AIXI-beleivers, who think that Bayesian probability is enough for AGI. They're just deluded and wrong. Less wrong maybe, but still wrong.)

Anyway, there is a way of mapping algebraic topology into the atomspace. This is explained in the wildly-misunderstood "sheaves" paper. The sheaves are really an attempt to not only do GDTV's, but to show how GDTV's "stitch together" into sections in a consistent way, viz obeying the axioms of algebraic topology (with the axioms of probability theory being a "trivial" special case). That is why the "sheaves" directory exists, and why the "matrix" directory is what it is. The "matrix" is a single slice through the network -- its kind-of like the linearized version, or the tangent-space version; the sheaves stitch together the tangent spaces. (The sheaves do the set-union and set-intersection, for you, in the atomspace).

I'm hoping that this gives a better insight into what is going on, in those directories, and why I'm saying "GDTV's have already been implemented, just not the way you are thinking about them". Of course, there's a lot more work to do; the current infrastructure is incomplete.

ngeiswei commented 5 years ago

The "set" will just be some atom, whatever the user wants, and the numeric value will just be a single float.

Yes, that's the plan, intervals are just a specialized case of that.

Absolutely NOTHING in the definition of a topology or a sigma algebra states that the sets have to be non-intersecting.

True, but defining a measure over a partition automatically defines it over the whole sigma algebra, the contrary isn't true. That is said, it'd be cool indeed if the user wasn't required to provide non-intersecting sets, and maybe the gaps can be filled via independence assumptions or whatever the available knowledge of the domain provides. Yeah, I can see we would want that, good point!

In this more general case, probability theory is insufficient; and the more general axioms of algebraic topology are required. These axioms are more complex, but they are required to describe language/biology/nature (contrary to the delusions of MIRI and AIXI-beleivers, who think that Bayesian probability is enough for AGI. They're just deluded and wrong. Less wrong maybe, but still wrong.)

Wow, I'm not sure I'm convinced, though I'd love to be. I would assume that probability theory is expressive enough, that is you can always encode whatever weird prior or weird rule of inference in the measure (I think Solomonoff Induction clearly shows that), but maybe that's not as elegant or efficient as it could be. I kinda like it the idea of going more abstract, I'm just not sure where to go from there.

ngeiswei commented 5 years ago

I kinda like it the idea of going more abstract, I'm just not sure where to go from there.

I guess having a deeper look at sheaves could be a start. I actually tried, but soon enough I lost interest because I didn't see the end game.

linas commented 5 years ago

it'd be cool indeed if the user wasn't required to provide non-intersecting sets

?? How is that even possible? If I consider a (natural language) "word" to be a "set of multiple meanings", then, in general, given two words, there's some chance they overlap. There's no algorithm for selecting words in such a way that they don't overlap. Generically, in AI/AGI, everything overlaps/intersects, and its impossible to provide a universe of non-intersecting sets.

Nil will recall "feature selection" in MOSES -- one of the goals of feature-selection was to find some "non-intersecting", or "linearly independent" features that more-or-less covered the entire "universe". (The other goal was to find the smallest such set that covered the greatest territory). It's hard, its mostly shots in the dark. (the goal of sheaves is to make it less-dark, by articulating how they overlap/connect.)

In the world of "linear things", the concept of matroid seems like a good way to talk about the overlapping-ness of things... https://en.wikipedia.org/wiki/Matroid

linas commented 5 years ago

it'd be cool indeed if the user wasn't required to provide non-intersecting sets

Hmm. I suppose you could argue that "one can collect observation counts for words, and then total that up, and divide each count by the total to get a probability measure for observing that particular word." And since words are discrete, that is then a valid probability measure on a set of discrete words. And that almost works, if you know nothing about linguistics. In real life, words have a Zipfian distribution: its impossible to make a list of all possible words. No matter how long you observe, you will always have a finite chance of observing a new word you've never seen before (and worse -- that chance is, in a certain sense, the "maximally largest" chance possible of doing so). So its impossible to even delimit the "entire universe".

Anyway, these are all theoretical objections. What is the actual goal of GDTV's? The original proposal #833 suggests that PLN will somehow make use of these. I guess PLN has a set of formulas for mashing together distributions in some way, during inference?

bgoertzel commented 5 years ago

I guess PLN has a set of formulas for mashing together distributions in some way, during inference?

Indeed it does ;)

On Wed, Oct 17, 2018 at 10:36 PM Linas Vepštas notifications@github.com wrote:

it'd be cool indeed if the user wasn't required to provide non-intersecting sets

Hmm. I suppose you could argue that "one can collect observation counts for words, and then total that up, and divide each count by the total to get a probability measure for observing that particular word." And since words are discrete, that is then a valid probability measure on a set of discrete words. And that almost works, if you know nothing about linguistics. In real life, words have a Zipfian distribution: its impossible to make a list of all possible words. No matter how long you observe, you will always have a finite chance of observing a new word you've never seen before (and worse -- that chance is, in a certain sense, the "maximally largest" chance possible of doing so). So its impossible to even delimit the "entire universe".

Anyway, these are all theoretical objections. What is the actual goal of GDTV's? The original proposal #833 https://github.com/opencog/atomspace/issues/833 suggests that PLN will somehow make use of these. I guess PLN has a set of formulas for mashing together distributions in some way, during inference?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencog/atomspace/issues/1884#issuecomment-430654118, or mute the thread https://github.com/notifications/unsubscribe-auth/AFolXCwT68kSGbtpKlf46G7ep9-UZcxVks5ul0BigaJpZM4XdzY- .

-- Ben Goertzel, PhD http://goertzel.org

"The dewdrop world / Is the dewdrop world / And yet, and yet …" -- Kobayashi Issa

rTreutlein commented 5 years ago

@linas I am working on understanding your method. The matrix library seems quite useful but I can't quite figure out how to use it as a GDTV. Specifically, I am trying to implement the Rain , Sprinkler , Grass Wet example from https://en.wikipedia.org/wiki/Bayesian_network I already have a working version that is using my Distributional Value code. Could you give me some initial pointers about how you would try to do this?

linas commented 5 years ago

Rain , Sprinkler , Grass Wet

Hmm. Implement, with PLN? I'm not sure how to answer, because the matrix code was never intended for PLN (I'm using sheaves for inference, not PLN). The code in the matrix directory does two things: (a) allows you to define some subset of the atomspace as being "linear" or "probabilistic" (b) a set of statistics-textbook utilities on that subset. One can make a strong case that part (b) should be thrown away, and replaced by SciPy or R. So the interesting part, for you, is part (a): how to provide an API into a subset of the atomspace, such that the subset looks like a vector (or matrix or histogram)

A different issue is that the rain-sprinkler-wet example is non-sparse, and so the atomspace is the wrong technology for solving that. The atomspace is designed for sparse data, and for knowledge representation; that example has no "knowledge" in it, and its not sparse. So, using SciPy or R is a much better technology choice for solving the rain-sprinkler example. So, I'm sort of confused about why you'd even want to do that in the atomspace, to begin with. But OK, onwards through the fog.

Say you have this:

(define rainy (Concept "is-raining"))
(define sunny (Concept "nice-sunny"))
(define sprinkle (Concept "The corn is wilting"))
(define conserve (Concept "Water costs money"))

; Initial data -- a non-sparse matrix.
(define pred (Predicate "rain-spinkle-correlation"))
(Evaluation pred (List sunny sprinkle) (stv 40 1))
(Evaluation pred (List sunny conserve) (stv 60 1))
(Evaluation pred (List rainy sprinkle) (stv 1  1))
(Evaluation pred (List rainy conserve) (stv 99 1))

To slap the matrix API on this data, you'd cut-n-paste from the file "object-api.scm", and replace (Predicate "foo") with (Predicate "rain-spinkle-correlation") and replace WordNode with ConceptNode and then define where the observation counts are located: (define (get-count PAIR) (cog-stv-strength PAIR)) Somewhat confusingly, the PAIR here ends up being the EvaluationLink, and since I stored the observation counts in the strength field of the EvaluationLink, that's where they have to be fetched from. That's it; everything else in there is boilerplate. Oh. well, change the name and id too. And give it a new name: say make-rain-sprinkler-api instead of make-ll-object-api-example.

So, after this cut-n-paste and light edit of boiler-plate code, you now have an object-oriented-programming-style object layer on top of the atomspace, that knows how to access a subset of the atomspace as if it were a matrix. From this point on, you can do all the part-(b) stuff: all the typical matrix-like things with it: compute observation frequencies (frequency == count/total), compute marginal probabilities, conditional probabilities, entropies, mutual information, cosine-distance, filter/reject values, etc.

That is, the make-rain-sprinkler-api returns the "joint observation count" object N(S,R) where S = Sprinkler turned on (true/false) and R = Raining (true/false). Given N(S,R), use (add-pair-freq-api (make-rain-sprinkler-api)) to get the joint probability function P(S,R) -- because add-pair-freq-api knows how to compute grand-totals, and divide, converting counts into frequencies. The add-pair-freq-api also provides entropies, marginals and conditionals. Overkill for a 2x2 matrix, but whatever.

The wikipedia article has a joint probability function P(G,S,R) -- the overkill route to obtain that would be to use the sheaves code, and try to infer the grammatical dependence of wet grass from rain and sprinkling. The example in that article infers a hidden-markov-model (HMM). Now HMM's have an almost-trivial grammar, (the grammar of an HMM is a "regular language" because the HMM itself is a "finite state machine" or actually, a "finite state transducer" to be more correct) -- so using sheaves to infer regular grammars and state machines is over-kill. The sheaves were designed for context-free grammars (and should work for context-sensitive grammars, but I'm confused about that).

By "grammar" I mean this: go to that wikipedia article, and look at the section " Definitions and concepts" -- "Factorization definition" and look at P(X_1=x_1,... ,X_n=x_n) Then assume that n=20 or 40, and assume that for each X, there are maybe 1K or 10K different values. So there are maybe 1K^20 == 10^60 different joint probabilities in the problem, or maybe 10K^40 == 10^160 joint probabilities, and that 99.99999999% of them are zero or are almost zero. The naive factorization completely fails: Bayesian networks, as naively formulated, are useless. Even if you assume that the network was Markovian (a finite state machine), you'd have 1K^2=10^6 or 10K^2=10^8 conditional probabilities, which is better, except that the grammar of a Markov process is always necessarily a regular grammar -- because a Markov process is a finite state machine. So the naive Bayes net approach to factorization is just .. well, wrong. You can kind-of do better, by using deep-learning instead, but that is a different conversation -- that is what the "skippy.pdf" paper is trying to explain.

So, the matrix infrastucture is designed for highly-sparse joint probabilities, say 1K by 1K matrices or even 100K by 100K, with almost all entries being zero, and the sheaf code is designed for factorization of the n=20 joint-probability function with 10^160 entries with almost all of them zero. So that infrastructure is complete over-kill for a 2x2x2 hidden markov model.

What can I conclude? -- I'm using sheaves for inference, you're using PLN for inference. What do we have in common? I'm proposing that your PLN-GDTV should be an object-oriented API on top of the atomspace, that can pull counts and frequencies from wherever they actually happen to lie, instead of forcing the user to jam them into some pre-architected Values. The reason for the OO API is to avoid designing a straight-jacket for the actual structure and organization of the data.

ngeiswei commented 5 years ago

@linas said

Generically, in AI/AGI, everything overlaps/intersects, and its impossible to provide a universe of non-intersecting sets.

I'm not so sure. In many cases things do not overlap. Take sensors for instance, say a photon detection sensor, it either registers a photon or not. Even if the sensor is faulty the registration itself is still clearly partitioned. Maybe the fact that random variables are functions (i.e. right-unique) is an indication that partitioning isn't that crazy hard...

But I do agree that requiring all data to be perfectly partitioned is too much to ask.

ngeiswei commented 5 years ago

@linas said

I'm using sheaves for inference, you're using PLN for inference. What do we have in common? I'm proposing that your PLN-GDTV should be an object-oriented API on top of the atomspace, that can pull counts and frequencies from wherever they actually happen to lie, instead of forcing the user to jam them into some pre-architected Values.

I like the idea. Based on what you describe, it seems your types of inferences are more direct evidence based, while PLN incorporates indirect types as well (deduction, induction, instantiation, etc), but that doesn't mean we can't borrow ideas or share code. We're gonna look into it...

linas commented 5 years ago

Inference: there are several kinds. For the natural-language Bayesian net, there are two related problems: 1) discover the rules of syntax, after observing many sentences 2) given some of the words in a sentence, guess the rest of them.

Compare this to the problem Ben assigned: You see someone with a suitcase in hand, and you know that suitcases are used during travel, therefore infer that they might be travelling. This is "common sense reasoning", and is more-or-less what PLN is designed for. But its also like style 2) inference. (just like making a whole sentence from just a few words, there are many inferences one can make given "suitcase", with some more likely than others, and all quite context-dependent.)

I'm focusing on 1) and Man Hin has started on 2) (for language) but of course they go hand-in-hand; Task 2) requires you to generate a grammatically valid sentence, which you cannot do unless you already know the grammar, and the probabilities.

The issue with suitcase->travel inference is, where did those common-sense rules come from? They need to come from somewhere, and we don't have any compelling story for obtaining them. So, for common-sense reasoning, it feels like PLN is a pump ready to do type-2 inferences, but has no data to work on.

I think the situation is better for MOZI and genomic/proteomic data: I'm convinced that the language-learning theory i.e. the step 1) inference, will work well for genomics/proteomics. That will allow you to discern "what's connected to what", a syntax that says things like "gene A and gene B upregulate X unless Y", and then, with a pile of these automatically-learned, extracted relationships, you can start doing actual type 2) inference on it (e.g. using PLN).

Of course, this would be very hard, but no matter how hard it is, doing the bio stuff still seems a lot easier than extracting common sense from .. what ... real-life experience? watching youtube videos? Ooof.

linas commented 5 years ago

Re: The OO API -- I stumbled onto that only after I started cut-n-pasting large blocks of code to handle similar-but-different cases.

A perfectly valid way to proceed is to let @rTreutlein finish the current GDTV, and then start trying it out on different kinds of data. The temptation will be to cut-n-paste code and adapt it to each special case: but this will be the indicator for where the API should go, and what should be in it.

rTreutlein commented 5 years ago

The more I think about this the more I am agreeing with @linas that having such an API to access the data is clearly superior to having a fixed representation.

This means that the user can provide an arbitrary set with associated counts. Because there is no guarantee that the sets on a specific Concept from 2 DVs will match, they will also have to provide a way to do set operations on their specific implementation. Which will also require various DVs to have different types as they won't be all compatible with each other.

Additionally, to allowing access to the Data the DV API needs some functions to write the results of the PLN inferences back into the Atomspace.

The primary result of this API should then be an n-dimensional matrix, which can be used to represent the DV of a single concept, joint-dvs with an arbitrary number of concepts (DV of a single concept is actually just a special case of this) and Conditional DVs.

To access those matrices the API implementation for a given type of a DV will require an Atom to be provided that has a DV of the correct type attached.

I think the easiest would be to copy and modify the matrix API to provide the above functionality.

For now, I'll just simplify my current DV code to not use Atoms/ProtoAtoms to store the intervals. This is still sufficient for basic DTVs and will eventually be accessible via the DV API. This representation should also be more efficient as only 1 value has to be accessed. More complex DVs would then live directly in the Atomspace.

linas commented 5 years ago

Thank you @rTreutlein keep me posted about how it goes. I'm still not entirely clear on how DV's will be used, but that does not have to be answered right away.

The current matrix code does not have any set/subset operations, because I haven't needed them (at that level). I am currently struggling with sets-of-words-having-the-same-meaning, which is a kind of related problem, but its not clean or well-defined, yet: the word "saw" can belong to different sets: seeing-verbs, cutting-verbs and cutting-tool-nouns (a subset of tool-nouns). I don't have any API for this, just assorted ad hoc atomspace structures. Unclear how it will work out.

rTreutlein commented 5 years ago

@linas Maybe this example will help a bit. Generally we want PLN to use DVs for more robust/flexible inferences instead of the current STVs. So assuming the system has learned (or been told) the DVs of A implies B and B implies C. We could, of course, like to calculate the truth value distribution of A implies C. Using my fixed representation as an example we could have:

DV for A implies B
{[0,0.5) : {[0,0.5) : 30 , [0.5,1) : 0} --If the truth for A is in [0,0.5) the same is true for B
,[0.5,1) : {[0,0.5) : 0 , [0.5,1) : 30}
}
DV for B implies C
{[0,0.33)     : {[0,0.5) : 15 , [0.5,1) : 0}
,[0.33,0.66)  : {[0,0.5) : 10 , [0.5,1) : 10}
,[0.66,1)     : {[0,0.5) : 0   , [0.5,1) : 15}
}

Here we can see that B has 2 different distributions. In A implies B it is split into 2 intervals whereas the conditional part of B implies C has 3 intervals. So if we have A inside [0,0.5) we know the distribution of B. Each interval is assigned a weight based on it's frequency. With [0,0.5) having a weight of 1 we check weather the conditional interval of [0,0.33) is inside it wich is the case. The conditional interval of [0.33,0.66) is only hal inside,because the intersection of [0,0.5) and [0.33,0.66) is only half the size of [0.33,0.66) so it's weighted by that factor. putting it together we have: {[0,0.5) : 15 , [0.5,1) : 0} 1 + {[0,0.5) : 10 , [0.5,1) : 10} 0.5 = {[0,0.5) : 20 , [0.5,1) : 5} as the distribution for C given that we know A is [0,0.5). We do the same for the case of A [0.5,0) which results in:

DV for A implies C
{[0,0.5) : {[0,0.5) : 20 , [0.5,1) : 5} 
,[0.5,1) : {[0,0.5) : 5 , [0.5,1) : 20}
}

But as you said above these intervals are just a special case and could theoretically be any kind of set. Which is why the user needs to provide a way of doing intersections and evaluating the size of these sets. Which I think is quite a bit different and simpler than your set-of-words-having-the same-meaning problem.

linas commented 5 years ago

Perhaps I should not mention this, to avoid over-complicating things, but how about fuzzy values? That is, histogram bins that overlap.

You've defined your distributions as histograms: counts in bins of some fixed width, that are completely disjoint from one-another. In real life, data usually doesn't want to fit into bins. My second or third job, as young adult, was in a physics lab, where lead-glass Cherenkov counters generated gaussian distributions in response to a fixed energy particle. As the energy went up and down, the center of the gaussian moved up and down. During bin-counting for the histogram, each disjoint bin accumulated counts. But clearly, each bin contained averages over gaussians from many different energies. Any fixed-energy hit might go into any of multiple neighboring bins. How would you know? The answer, if I recall, was a convolution of the detector gaussian with the rectangular bin shape. The convolution had to be pushed through the entire data analysis pipeline. So, for example, you might have observed

[0,0.5) : 20 
[0.5,1) : 5

but you knew, a-priori, that some of the 20 counts should have gone into the upper [0.5,1) bin, and that some of the 5 counts should have gone into the lower [0,0.5) bin. The trick was to track these uncertainties (kind of like "confidence" in PLN; we knew we could not be fully confident in the energy measurement). Oh, and yes, different detectors had different bin-widths, yet all were detecting the same event, so there had to be inference through all this, what particle went where with which energy. It was a mess. I also blew out an oscilloscope that year. Still don't know how; it just blew. Whatever. Off-topic. I'm rambling and I love it.

Perhaps you could represent overlapping bins with a beta distribution. Or maybe just a truncated gaussian. Dunno. Or this might be over-complicating things at this stage.

ngeiswei commented 5 years ago

Interesting exchange.

Regarding imperfect sensors, issue #833 actually discusses that (in sections "Imperfect boolean sensor" and "Funny people at the party") and ends up with the same conclusion that it requires a convolution product.

However, my view on this has shifted. Basically I think you can always treat a sensor as perfect as long as your memory of it is perfect (I know it's not true, but it's very close to true, even closer if your computer has ECC RAM, etc), and rather treat reality as imperfect, or more precisely the relationship between a model of something occurring in reality and the sensor in question.

It's a subtle distinction but it allows you not to make any assumptions about the imperfectness of a sensor, and rather let the system discover it on its own.

For instance, let's say you have a photon sensor, where its activity is represented as

Evaluation <TV>
  Predicate "S"
  Concept "I"

meaning that instance I has been observed to trigger sensor S with truth value TV.

Now if you consider the sensor is imperfect you may have the truth value reflect that, for instance if the rate of failure is 0.01, the TV will be (stv 1 c) such that c would give to the associated second order distribution a mean of 0.99. In the perfect sensor case however you'd have TV be (stv 1 1) however the imperfectness would be represented in the relationship with the existence of photons

Implication (stv 1 c)
  Predicate "S"
  Predicate "Photon" 
linas commented 5 years ago

OK, a few more meta-comments.

In the physics setting, things tend to operate at the noise-limit: you can barely tell apart what's there, from random noise. For bin-counts, this meant the bins were extremely jagged: bins with high counts would be next to bins with really low counts. Naive smoothing was dangerous, because the total width was meaningful (the width is reciprocal of the particle lifetime; you measure the lifetime by deducing it from the width.)

Regarding sensitivity - last I looked, the very best detectors could detect only one out of two; rates of 1 out of 10 or 100 are more "typical", but this is highly variable, depending on application. At any rate, sensitivity of 99 out of 100 is a proverbial wet dream. Different instruments respond differently by a dozen orders of magnitude. Designing pipelines to process raw data is a big part.

In network science aka linguistics, I'm finding that there is no such thing as "the noise limit", in the classical engineering, shannon-coding-theory sense. Instead, all distributions are Zipfian, always (or derivable from Zipfians -- there are several really nice, super-smooth bell curves in the language data! Which is kind of neat!) The problem with Zipfian is, the more data you collect, the more likely you are to see rare events, and this never tapers off. There is no large-N limit, not even theoretically. Insofar as common-sense reasoning derives from language, I expect the same general behavior.

How do these meta-comments affect the design? No clue. But I figure you should know. In the matrix code, I have some very basic filters to knock out rows and columns, or knock out individual entries. And of course, the hardest part of the project is to figure out how to "take averages" so that "the noise cancels".