Change behavior for random numbers

gkastrinis commented 7 years ago

If I understand correctly the $ means generate a random numerical value that is unique in the given predicate, right?

I have the following code

.decl A(x:number, y:number)
.output A

A(x, $) :- x = 1.
A(x, $) :- x = 2.
A(x, $) :- x = 1.

My issue is that although two different values for x will generate different values for y, the same value for x (1) will generate different values for y (imaging a more complicated example, not just rule duplication).

cat A.csv 
1   0
2   1
1   2

Is it possible to change the behavior and have the same (single) value for x? Since, different values for x generate different values for y there should be some kind of checking already.

gkastrinis commented 7 years ago

I've been reading some Souffle slides I found regarding the $ functor and now I believe I have a better understanding of the situation.

My understanding now is that $ is a counter functor and not a random number generator. So the check that you don't get the same value of y for different values of x is not an actual one. It just follows from the fact that some counter is auto-incremented each time a new value for x is evaluated. Is that correct?

Also, when I opened the issue I had in mind more compex cases other than the simple one of rule duplication. That said, I couldn't find a more valid example. I tried a few things but I didn't see anything weird (the same value reappearing). So, I guess you could close the issue as invalid.

Just one question though. It is not clear to me if the auto-increment on the count that I mentioned above is global or per relation. I have the following code

.decl A(x:number, y:number)
.decl A2(x:number, y:number)
.decl B(x:number)
.output A, A2, B

A(x, $) :- B(x).
A2(x, $) :- B(x).

B(x) :- x = 1.
B(x) :- x = 1.
B(x) :- x = 3.

and I get

 > cat A.csv
1       0
3       1
> cat A2.csv
1       2
3       3

so it seems like it is a global counter. But can I make that assumption? Or it just happens in my example?

ghost commented 7 years ago

The counter is implemented as a variable of the current RAM environment, and is incremented for each instance of it. So yes, you can think of it as global because the same program won't (I believe) ever be evaluated against multiple different RAM environments. Whether it must be global in the whole semantic model of the thing I really do not know.

As the counter is a little weird in the whole semantics of Souffle and datalog, because it isn't like the funtors in that it can occur in a relation and takes no arguments, but isn't like the variables or constants either because it never has the same value as itself previously had, which seems a little anti-declarative. Here's how I made my peace with the semantics:

The constants and variables are really just zero arity functors which give themselves (i.e. the identity function bound over their own name, if you like). The counter is another zero-arity functor which simply guarantees that its value will be numerically monotonically increasing and unique within relations -- the fact that it does this globally is an implementation detail that doesn't go against anything I've said. Its value is also irrelevant, the only gaurantee is that it will be unique and monotonic. A zero arity functor is the only thing that can occur within the tuple of a predicate. This isn't so much "the way things are" but "a way of thinking" that gives consistency. I hope this helps.

gkastrinis commented 7 years ago

In my setup I care about uniqueness (not really about monotonicity) and the fact that this seems to be a global counter. If that assumption were to break I'd need to come up with a work-around.

b-scholz commented 7 years ago

Sorry I should have answered earlier. I have been travelling.

Indeed it is a counter and every time when the $ functor is evaluated a new counter value is returned. We needed this counter functor for introducing new contexts for points-to analysis.

This was the time before records. However, we had serious semantic issues with counters (i.e. we could not use it in recursive relations because it violated some required properties of semi-naive evaluation).

I am inclined to retire this feature entirely.

May I ask how you would like to use it in future?

I believe you can replace this feature using records.

gkastrinis commented 7 years ago

I would really prefer if you didn't drop the feature :)

Currently, I'm auto-generating rules trying to skip the problems I had reported in the past with subtyping and records. I'm simulating the constructors from LB and the $ functor is really crucial in my current approach.

b-scholz commented 7 years ago

I see. Are you sure you cannot do it via records in the sense that you may want to introduce a selector record aggregating various record types.

.type R1 = [x:number]
.type R2 = [x:number, y:number]
.type R3 = [x:number, y:number, z:number]

.type S = [r1:R1, r2:R2, r3:R3]

where S combines all three record types. To simulate polymorphism let's assume that only one of the fields in S is set; all others are set to nil.

gkastrinis commented 7 years ago

Hmm, let me think about it and I will reply again. Thanks for the suggestion!

gkastrinis commented 7 years ago

I thought about it (and I think I had also in the past) and though in general it is a nice approach, I believe it will not work in my setting.

I want to support subtyping. And also relations that represent types.

So for two types Shape and Triangle with some constructors each, I need the following rule

Shape(x) :- Triangle(x).

If I follow your suggestion, Shape and Triangle might have totally different structures. And the rule above will not be easy to generate.

So for now, I think I will stick with using $. Do you have any strong reason why you would like to remove the feature? Or just for cleanness?

HerbertJordan commented 7 years ago

To form sub-type relations you might wanna consider the following approach:

Whenever you have in C++ (or Java) inheritance, you are extending a base struct containing a set of fields with a list of extra fields. Thus, you might have:

struct Shape { int color; };
struct Circle : public Shape { int diameter; int radius; };
struct Square : public Shape { int length; }

What the compiler does for you is actually creating three data types that have the structure

struct Shape { int color; };
struct Circle { int color; int diameter; int radius; };
struct Square { int color; int length; }

and remembering that some types can be (implicitly) casted into others.

You can do the same in Datalog Relations.

.decl Shape ( color : number )
.decl Circle ( color : number , diameter : number , radius : number )
.decl Square ( color : number , length : number

and by adding the rules

Shape(c) :- Circle(c,_,_).
Shape(c) :- Square(c,_).

you get that every circle and every square is a shape.

However, you might ask right now how to distinguish those? The thing is, object oriented languages give you one more thing -- object identities. In simple words, every object has a address, and you can use the address to distinguish objects. This you do not have in Datalog.

Here you have to think more like in the data base world. The equivalent of an object id in a relational data base is a primary key. You will have to define one as well here. Since color is not a suitable one, you might have to resort to an artificial primary key, e.g. some sort of ID:

.decl Shape ( id : number, color : number )
.decl Circle ( id : number, color : number , diameter : number , radius : number )
.decl Square ( id : number, color : number , length : number

with

Shape(i,c) :- Circle(i,c,_,_).
Shape(i,c) :- Square(i,c,_).

This way you can figure out whether an 'object' stored in the shape relation is a circle or a square, by joining on the respective sub-type relation.

Of cause, now you are left with the ID field, which needs to be filled. This is what you have been using the $ for. Ideally, in your use case you can identify another combination of attributes to form a primary key for your base class, to avoid this issue. But if this is not possible, you can use an artificial ID as shown here. And you should fill the ID field through your fact files. If you encode facts manually, just give them different numbers.

One crucial property of pure Datalog is that during the evaluation no new field values can be produced, just new combinations of already existing elements can be created. On an abstract point of view, a pure Datalog program can not introduce new circles or squares during the evaluation. What is not in the input facts, can not be in the results.

The $ is an extension that would allow you to do so, but it has an impact on the performance and prevents certain optimizations. It should not be used if not absolutely necessary. To represent elements of a hierarchy of objects, this should not be necessary.

gkastrinis commented 7 years ago

Thanks for your extended suggestion but I am afraid I confused you with the word "constructors". I wasn't talking in the context of C++ but in the context of LogicBlox.

LogicBlox is not pure Datalog in the sense that they allow for constructor predicates. That is predicates that take a list of values, e.g. a pair (x, y) and produce a unique value from that pair. The same x, y values will always produce the same value and different x, y values are guaranteed to produce different values.

This is extremely useful for example when you want to create a context value in a static analysis setup. In one on souffle slides, the $ functor is mentioned to have been introduced for exactly that reason. Using it (in a strict way) makes things relatively clean and clear.

In my case now, I try to automate code generation that would produce valid souffle code. And I have the following constraints. 1) Types can have multiple constructors 2) Types can have subtypes

Without (2) I could try the solution that Bernard suggested --though I am not sure if records in Souffle work with self recursion (e.g. type AccessPath is either a string or a record of string x AccessPath).

But with subtyping, I fail to see how I could use the same method.

A type T1 might have 2 constructors so I would have

.type R1 = [x:symbol]
.type R2 = [x:number, y:symbol]
.type T1 = [x:R1, y:R2]

but then a type T2 that is a subtype of T1 might have a constructor as well.

.type R3 = [x:symbol, y:symbol]
.type T2 = [x:R3]

Subtyping means that if a predicate has an argument of type T1, I could feed it with entries from a predicate with arguments of type 2.

But T1 and T2 don't have the same structure.

The only(?) awkward solution that I see is to include in T1 all the subtypes as well.

.type T1 = [x:R1, y:R2, z:R3]

gkastrinis commented 7 years ago

If not mistaken the theory behind constructors (in LB) is skolem functions http://mathworld.wolfram.com/SkolemFunction.html

HerbertJordan commented 7 years ago

So, the good news is, recursive records are supported by souffle. Also, records are the constructs within souffle which produce a unique ID value for arbitrary tuples (x,y,z,...) as you desire.

The bad one is, that the support for

Subtyping means that if a predicate has an argument of type T1, I could feed it with entries from a predicate with arguments of type 2.

is limited to non-record values only. This is a known limitation, also raised as an issue by yourself #380. Sub-type relations among records are not supported. @b-scholz has somebody worked on this ever?

The ugly truth: The main problem with this is that for any field that may contain a value of the base type, it is not clear which constructor should be used for matching when 'opening up' this value. It might be one of many alternatives. To make this save, dynamic type information would have to passed along and checked when attempting to read the value. The development of this takes quite some effort. Or we could just trust the user, for whom it would be next to impossible to debug such an issue when s/he gets it wrong. Also not a great approach.

b-scholz commented 4 years ago

Do we still need the counter or can we retire the feature in the presence of records?

souffle-lang / souffle

Change behavior for random numbers #401