Histogram representation for DVs

rTreutlein commented 5 years ago

I am trying to decide between a couple of options of how to structure the Histograms used for DVs and failing. So I'll be explaining all the options that I see and why they don't fulfil the requirements.

First how it worked until now. I have been using Histograms where the bins vary in size and are allowed to overlap. The advantages of this are that get a very flexible and accurate system that's easy to work with. For example, if you want to merge 2 Histograms just combine the bins that are the same and keep all others. I even started thinking about allowing bins that are not continuous uniform distributions but any kind of distribution to be even more accurate. Which is where I started suspecting something had gone horribly wrong. Using distributions to approximate other distributions sure would be quite accurate but that wouldn't be a histogram any longer. The reason to use a Histogram is to get a small representation of the distribution so having an unlimited number of overlapping Bins also seemed wrong.

So let's not allow overlapping bins. We lose a bit of flexibility and accuracy but that's okay we never expected the Histograms to be super accurate and there are some good algorithms out there for dynamically changing the bins of a Histogram to be accurate while not requiring too many bins. So what about merging 2 Histograms together well just split the bins wherever there is an overlap. Seems simple and works quite well for normal Histograms but we also need n-dimensional Histograms to represent joint-probability distributions. And there it gets tricky the complexity jumps from O(n+m) to at least O(n*m) where n,m are the number of bins in each histogram. And the resulting histogram would have a lot more bins so we might want to use some of those dynamic histogram algorithms to merge some of the bins but of course in n-dimensions, that's not straight forward.

The main issue here is that the bins have variables sizes and there could even be gaps between them. We don't really need to remove variable sizes completely but needs to structured more clearly. So we could use binary-trees for 1-dimensional histograms, quadtrees for 2-dimensional, and so on. Now that it's more clearly structure finding overlaps gets easier and don't need to create too many new Bins instead distributing the count over the Regions on a given level of the tree. This has the disadvantage of not allowing Intervals that only contain a point which would be required to represent boolean logic with DVs.

So that are the options as far as I can see. Really not sure in which way to go from here but hopefully somebody else has a brilliant idea.

bgoertzel commented 5 years ago

To deal with n-dimensional histograms efficiently, couldn't you use cover trees to approximate n-dimensional probability distributions?

This paper explains how to represent probability distributions using kd-trees,

https://pdfs.semanticscholar.org/00b3/87db5659824ba80c2f83a66abaa0eb6c7500.pdf

and this paper explains how to merge (simplified) cover trees

http://proceedings.mlr.press/v37/izbicki15.pdf

(cover trees are more efficient representations than kd-trees, adapting their structure to the data in a nice way....)

This is somewhat complicated however, as your email indicates, the simplest solutions get problematic in multiple dimensions...

-- Ben

On Thu, Jan 17, 2019 at 4:59 PM Roman Treutlein notifications@github.com wrote:

I am trying to decide between a couple of options of how to structure the Histograms used for DVs and failing. So I'll be explaining all the options that I see and why they don't fulfil the requirements.

First how it worked until now. I have been using Histograms where the bins vary in size and are allowed to overlap. The advantages of this are that get a very flexible and accurate system that's easy to work with. For example, if you want to merge 2 Histograms just combine the bins that are the same and keep all others. I even started thinking about allowing bins that are not continuous uniform distributions but any kind of distribution to be even more accurate. Which is where I started suspecting something had gone horribly wrong. Using distributions to approximate other distributions sure would be quite accurate but that wouldn't be a histogram any longer. The reason to use a Histogram is to get a small representation of the distribution so having an unlimited number of overlapping Bins also seemed wrong.

So let's not allow overlapping bins. We lose a bit of flexibility and accuracy but that's okay we never expected the Histograms to be super accurate and there are some good algorithms out there for dynamically changing the bins of a Histogram to be accurate while not requiring too many bins. So what about merging 2 Histograms together well just split the bins wherever there is an overlap. Seems simple and works quite well for normal Histograms but we also need n-dimensional Histograms to represent joint-probability distributions. And there it gets tricky the complexity jumps from O(n+m) to at least O(n*m) where n,m are the number of bins in each histogram. And the resulting histogram would have a lot more bins so we might want to use some of those dynamic histogram algorithms to merge some of the bins but of course in n-dimensions, that's not straight forward.

The main issue here is that the bins have variables sizes and there could even be gaps between them. We don't really need to remove variable sizes completely but needs to structured more clearly. So we could use binary-trees for 1-dimensional histograms, quadtrees for 2-dimensional, and so on. Now that it's more clearly structure finding overlaps gets easier and don't need to create too many new Bins instead distributing the count over the Regions on a given level of the tree. This has the disadvantage of not allowing Intervals that only contain a point which would be required to represent boolean logic with DVs.

So that are the options as far as I can see. Really not sure in which way to go from here but hopefully somebody else has a brilliant idea.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencog/atomspace/issues/1992, or mute the thread https://github.com/notifications/unsubscribe-auth/AFolXLndZSlpRXMsFCUOVdnRv8qFEZn5ks5vEJ3SgaJpZM4aGBmb .

-- Ben Goertzel, PhD http://goertzel.org

"The dewdrop world / Is the dewdrop world / And yet, and yet …" -- Kobayashi Issa

ngeiswei commented 5 years ago

@rTreutlein , if I understand correctly, you are you looking for a technique to "lossy compress" multidimensional histograms, like in the publication you mentioned for 1d-histogram?

I know nothing on the subject. But if you can't find any existing literature on that, I think @bgoertzel's suggestion is pretty good. In bonus it would work for any space, not just multidimensional intervals, as long as you can provide a distance between its elements. What cover tree would offer is an efficient way to retrieve neighbors of say multidimensional intervals, and then you could agglomerate them according to similarities, like counts or such.

rTreutlein commented 5 years ago

@bgoertzel In that paper, they store all the samples in a kd-tree which I don't think we want to do. We want to have some Intervals with an associated count that approximate the distribution. And kd-trees and I assume also cover trees are for point-like data, not Intervals. Now we could use R-Trees they could store Intervals without issue but this would only be a performance improvement to my second option the fundamental problem of merges generating lot's of new intervals that are then hard to combine into a few intervals that still represent the distribution decently, remains. The issue being that we don't want to just merge the R-Trees but also the Intervals contained in them.

rTreutlein commented 5 years ago

But it might be possible to build a kd-tree that only contains the count of the elements in each section. This would be similar to my third option with the benefit of handling higher dimensions better. And I think this might even allow for Singleton Sets or something very close to it, which was the main downside of my option 3. So, I think I am going to try that. It seems promising but we will only know for sure once I got a prototype working.

ngeiswei commented 5 years ago

Cover trees are data agnostic, all they need is a distance measure. Here's a Haskell implementation https://github.com/mikeizbicki/hlearn they even claim it's the fastest, though the project hasn't been updated since 2016.

rTreutlein commented 5 years ago

Well, the problem with cover trees is they don't help us with merging of intervals which as I said gets very messy the more dimensions you have. For that, I wanted a data structure that could not only store the intervals well but also provided some structure that would help improve the merging process.

I hoped kd-trees would provide that which they kind of did but you can't really rebalance them. Which is important if you don't know anything about the distribution beforehand as it allows you to dynamically improve the histogram. You could use quad/octrees they are even more regular and we could update them dynamically but you can't really use them well unless you know a-priori what the maximum and minimum values of the distribution are. Which we don't and there might not actually be a max or a min. Additionally of course such trees tend to have a lot of empty nodes.

I think the problem is just that in higher dimensions dealing with intervals comes with a lot of tradeoffs and because the DistributionalValue should work for a lot of things it's really hard to make that choice. To solve that I think we need to give up on the idea of storing Intervals and instead just store points representing the "centers" of Intervals. Those we can store in a multi-dimensional variant of an avl-tree with a fixed size. This might cause a loss in accuracy when working with these values because there are no longer clear boundaries/Intervals though depending on how good those boundaries would have been chosen it might end up being a wash. I have got a prototype that works well so far but I still got to try and see what the results are for actual calculations.

linas commented 1 year ago

FWIW, the FloatValue is a vector of floats. It would would be straight-forward to write a HistogramLink. For an example, cut-n-paste the code in AccumulateLink, which just sums up all of the floats in the vector. Hack the guts to make a histogram instead of a sum.

opencog / atomspace

Histogram representation for DVs #1992