revbayes / revbayes

Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language
http://revbayes.com
GNU General Public License v3.0
56 stars 25 forks source link

[Feature Request] Ideas from improving containers #370

Open mlandis opened 1 year ago

mlandis commented 1 year ago

Here are a few ideas of container-based features, borrowed from R and Python, that we might be able to support in Rev.

Data tables

Real world data table files (e.g. csv file) may contain fields of different types (int, float, str, etc.).

RevBayes does not have a data table type, but it can read data table files as two-dimensional vectors. RevBayes behaves well when all values in a csv file can be converted into the same type. For example:

$ cat example1.csv
col1,col2
0.1,0.3
0,0.2

is read as

> x1 = readDataDelimitedFile("example1.csv", delimiter=",", header=true)
> x1
   [ [ 0.1000, 0.3000 ] ,
     0.0000, 0.2000 ] ]
> type(x1)
   MatrixRealPos

However, when a file contains multiple distinct types, the resulting type is a generic RevObject[][]. For example:

> x2 = readDataDelimitedFile("example2.csv", delimiter=",", header=true)
> x2

   RevObject[][] vector with 2 values
   ==================================

   [1]

   RevObject[] vector with 2 values
   ================================

   [1]
   0.1

   [2]
   cat

   [2]

   RevObject[] vector with 2 values
   ================================

   [1]
   0

   [2]
   NA

> type(x2)
   RevObject[][]

Part of the problem comes from relying on a two-dimensional vector to represent the table. Row-vectors must have elements of the same type. That means, any row with different types across columns gets cast to the most generic type, RevObject.

A solution would be to add a DataTable object. This could then store vectors across columns (not rows) while also supporting more advanced ways of indexing (e.g. column names, slice-indexing, etc.).

Example of data table use:

x = readDataTable("my_file.txt", header=true, delimiter=",")
x[1:2, 3]
    height
    3.14
    3.21
x.col[:, ["height", "width"]]
    height  width
    3.14    7.13
    3.21    4.55
    3.77    4.74

Slice-indexing

Currently, we can either access the entire vector or individual vector elements. It should be possible to add basic support for slice-indexing. Current behavior:

> y = [0, 1, 2, 3, 4]
> y[0:3]
   Error:   Argument or label mismatch for function call.
   Provided call:
   [] (Natural[]<constant> 'index' )

   Correct usage is:
   [] (Natural<any> index)

> y[ [0, 1, 2] ]
   Error:   Argument or label mismatch for function call.
   Provided call:
   [] (Natural[]<constant> 'index' )

   Correct usage is:
   [] (Natural<any> index)

Desired behavior:

> y = [0, 1, 2, 3, 4]
> y[0:3]
    [0, 1, 2]

> y[ [0, 1, 2] ]
    [0, 1, 2]

Dictionaries/maps

It'd be nice to be able to use dictionaries or maps as unordered containers. Ideally, keys and values could be of any type. For example:

x = Dictionary()
x["my_tree"] = readTrees("my_tree.tre")[1]
x["my_data"] = readDiscreteCharacterData("my_data.nex")

Dictionaries of containers (vectors or other dictionaries) could be useful, too.

bredelings commented 1 year ago

Pairs and Tuple

Another thing that would be useful to have is tuples. Pairs are a special case: a 2-tuple.

Tuples are different than vectors because each element of a tuple can have a different type, whereas every element of a vector must have the same type.

In c++, we have the type std::pair<T1,T2> for pairs. It would be nice to have the same think in RevBayes.

If we have a type like Vector<Pair<Int,String>>, then this is one way to implement a dictionary. Although not the most efficient.

dict = [("alice",1), ("bob",2)]

Implicitly but strongly typed

Sebastian noted that this combination can be complicated. However, note that languages like Rust are implicitly but strongly typed. So there is a lot of prior art here.