souffle-lang / souffle

Soufflé is a variant of Datalog for tool designers crafting analyses in Horn clauses. Soufflé synthesizes a native parallel C++ program from a logic specification.
http://souffle-lang.github.io/
Universal Permissive License v1.0
910 stars 200 forks source link

ADT Representation #2055

Open b-scholz opened 3 years ago

b-scholz commented 3 years ago

A single ADT branch uses two records at the moment, i.e., we have following encoding.

.type B = b1 { fields1...} | b2 { fields2 ..} | ...

A branch is encoded as [<branch-id> [fields_i]] using two records. The logic behind this encoding is that no memory space is wasted. However, it comes not for free - we have a double-unpack for records (via their ordinal numbers) to access the field data.

An alternative way to represent ADT is to use a single record of the format: [<branch-id> e1 ... en] where <branch-id> is the identification of the branch followed by the fields. The number of elements is determined by the longest branch. If branches require less elements, the unused elements are just padded with zero values.

Using an alternative encoding may speed up some applications using ADTs.

quentin commented 3 years ago

In any case it would then be nice to have a simple way to construct values of an ADT using the Souffle API. Currently we have to do a little bit of reverse engineering of the ADT representation encoding into records and ram-values in order to create values:

if the ADT is an enum:
  return the branch index as a ram value
else if the branch is an empty record
  return the record [branch-index, 1]
else if the branch is a single element record
  return the record [branch-index, element-value]
else if the branch has is a multiple elements record
  return the record [branch-index, [multiple elements record]]
b-scholz commented 3 years ago

Indeed, I did not list the special cases for the zero and one arity case.

quentin commented 3 years ago

Having a flat representation should be less memory consuming in the worst case when all the produced values for an ADT have the longest branch(es) form, because it saves one record indirection.

I am personally in favour of simpler, flat ADT encoding. Saving a pack/unpack operation should have a far more positive impact compared to the little negative impact of using more memory in average.

b-scholz commented 3 years ago

The mapping to pack/unpack is encapsulated in the ast2ram translator. Thus, only a few lines in the code need to be changed. The question is whether we would like to keep both encodings or make everything flat.

Ideally, we do some experiments to decide which encoding is better in practice.