usethesource / vallang

Generic immutable recursive data representation API targeted at source code models and more.
Other
36 stars 12 forks source link
algebraic-data-types biginteger database functional graphs library maps rational-numbers relational-algebra serialization sets

Vallang

Build Status Maven Release

What is Vallang?

Vallang is a highly integrated and mostly-closed collection of mutually recursive fundamental data-types on the Java Virtual Machine:

Operations on these data-types are too many to list here. A selection is listed below, but you should expect the features to be pretty low level; i.e. directly accessing and manipulating the data rather than providing analysis algorithms. Algorithms in the library are added only if programming them below the abstraction layer of vallang provides a major efficiency benefit or it can factors out highly common client code into a reusable feature. More on this design decision later.

Vallang has a type system based on type systems in functional programming, but note that each value has a most specific dynamic type associated always at run-time. More on the design of the type system below, but here is a list of vallang types:

Sub-typing is co-variant for the container types list, map, set and tuple. Otherwise these rules define the entire type system:

There exists an extension mechanism for adding type kinds and their associated value kinds to the vallang system. Rascal, for example, uses this to represent functions and co-routines at run-time. The extension mechanism works by declaring a bi-directional transformation between the extensions and a symbolic representation of choice (chosen freely from the core representation mechanisms of vallang). This bidirectional mapping is mostly used when serializing and deserializing values (see below).

The types of vallang in Java are represented by a Composite design pattern with maximally shared instances of (the otherwise opaque) abstract class Type. These types expose fast implementations of sub-type and type equivalence for implementing fast pattern matching.

The values of vallang are all instances of IValue and sub-interfaces thereof. For every kind of value there is an interface, e.g. ISet, IList, ITuple and IConstructor but they are not type-parametrized because Java's type system can not represent the aforementioned co-variant sub-typing rules we require.

Why does Vallang exist?

vallang is a UseTheSource project recently renamed from rascal-values, which was known earlier as pdb.values.

The project started as a part of the IDE metatooling platform in 2007 as a generic library for representing symbolic facts about source code, for use in the construction of IDE services in Eclipse, then it continued to form the basis of the run-time system for Rascal starting 2009, and finally was renamed to vallang to serve a wider scope.

We designed of vallang based on experience with and studying the ATerm library and ASF+SDF, but also by learning from RSF (Rigi Standard Format), Rscript and GXL and S-expressions. Perhaps JSON and YAML have also had a minor influence.

The main purpose of vallang is to provide a flexible and fully typed collection of symbolic representations of data, specifically "ready" to represent facts about software systems but amenable to basically any form of symbolic data analysis purposes.

This purpose aligns with the mission of the Rascal metaprogramming language which is made to analyze and manipulate exactly such symbolic representations. Therefore vallang is the run-time environment for both interpreted and compiled Rascal programs.

Note that while vallang is a great fit for symbolic data analysis, it is currently not the best fit for numerical data analysis as it features only a uniform symbolic represetation of numbers of arbitrary precision and size (ints, reals, rationals). In other words, the numbers and collections of numbers in vallang are optimized for storage size, clarity and equational reasoning rather than optimal computational efficiency. This also means that indirect numerical encodings of data (i.e. using numerical vectors and matrices), which are often used in symbolic analyses to optimize computational efficiency are not the right strategy when using vallang: it's better to stick with a more direct symbolic representation and let vallang maintainers optimize them.

Next to the maintainers of Rascal, the main users of vallang are currently programmers who write data acquisition and (de)serialisation adapters for the Rascal ecosystem:

Nevertheless vallang is a generic and Rascal-independent library which may serve as the run-time system for other programming languages or analysis systems, such as term rewriting systems, relational calculus systems, constraint solvers, model checkers, model transformations, etc.

The latter perspective is the reason for the re-branding of rascal-values to vallang. You might consider vallang as a functional replacement for ECore, an alternative to the ATerm library on the JVM, or an alternative to JSON-based noSQL in-memory database systems, or a way of implementing graph databases.

Finally, vallang is a JVM library because that is where we needed it for Rascal and the Eclipse IDE Metatooling Platform. We hope other JVM programmers will also benefit from it and we have no plans of porting it at the moment to any other technological space.

What are the main design considerations of Vallang?

Vallang values are symbolic and immutable.

We think software analysis is complex enough to be confusing to even the most experienced programmers. Manipulating huge stores of hierarchical and relational data about software easily goes wrong; trivial bugs due to aliasing and sharing data between different stages of an analysis or transformation can take weeks to resolve, or worse: will never even be diagnosed.

Since our goal is to provide many more variants of all kind of software analyses, we wish to focus on the interesting algorithmic details rather than the trivial mistakes we make. Therefore, vallang values are immutable. Sharing of values or parts of values is allowed under-the-hood but is not observable. The library is implemented using persistent and/or maximally shared data structures for reasons of efficiency.

Users of vallang freely share references to their data to other parts of an analysis because they know the data can not change due to an unforeseen interaction. We also believe that the immutable values can be shared freely between threads on the JVM, but there are not enough tests yet to make such a bold claim with full confidence.

Vallang values are generated via the AbstractFactory design pattern and do not leak implementation representations

The reason is that client code must abstract from the implementation details to arrive at the mathematical precision of symbolic reasoning which vallang should provide.

This also serves a maintenance and evolution purpose for implementations of the library. We can plug in a new implementation of the library without affecting client code.

Note that for efficiency reasons values produced from different implementations of an abstract value factory (different implementations of IValueFactory) are not required to interact correctly.

Vallang values uniquely deserialize/serialize from/to a standard and simple expression language

The general rule is that for any two JVM object reference o and p to any vallang object the following rule holds: o.toString().equals(p.toString) <==> o.equals(p)

We currently random test this rule and it sometimes fails due to a deprecated feature called "annotations" which we are removing to make the above contract true.

The intended effects of the toString/equals contract of vallang are the following:

The latter point is one of the main reasons why vallang is called a language. The result of anyValue.toString() is a member of a precisely defined textual language. The full textual language is generated from the value type, and sub-languages are generated from the respective sub-types. void is the empty language. In this manner the types of vallang act like non-terminals of a precise context-free grammar. The vallang language as defined above is a strict sub-language of the Expression sub-language of Rascal.

The other reason why vallang is names as a language is because the implementations of the IValue interface and its sub-interfaces are seen as a closed combinator language for computations on the values, and their implementations are interpreters for this language.

Vallang values always know their most-precise concrete ad-hoc run-time type

Vallang values include both trees and relations

Even though both trees and relations are generic enough to represent any data, sometimes a graph or table is more natural than a tree and sometimes the other way around.

Rascal is a language which can be used to easily switch between different representations of the same information, using pattern matching, querying, comprehensions, etc. From vallang you should not expect any help in this regard: the choice of representation for any information is a key design decision for the user of vallang.

Vallang supports query and (persistent) updates to all symbolic values efficiently

Vallang equality checking is fast

Who contributed to Vallang?

and occasional contributions from others please see github's factual overview

What is in the near future for Vallang?

  1. Removal of the "annotations" feature, which is completely replaces by the "keyword fields" feature. The main differences between these features are:
    • While they both offer extensibility to the set of names and typed fields of nodes and constructors, annotations can never influence equals() while keyword fields always do.
    • Syntactically the notation for keyword fields is more compact: f()[@myAnno=1] versus f(myField=1)
  2. Further integration of the capabilities of Capsule for persistent and optimized immutable collections under the hood of IMap, ISet, IRelationAlgebra:
    • Reflexive relations with two indices (for both columns)
    • Heterogeneous collections of numbers (unboxing down to primitive types to safe space)
    • Smooth and incremental transitions from map to multimap representations
  3. IBag, the bag[&T] type