kowainik / tomland

🏝 Bidirectional TOML serialization
https://kowainik.github.io/posts/2019-01-14-tomland
Mozilla Public License 2.0
122 stars 39 forks source link

Implement our own 'PrefixTree' #29

Closed chshersh closed 6 years ago

chshersh commented 6 years ago

After thinking a little bit more on how to store Key components I came to conclusion that we really want to store component on each level of Toml data type instead of storing whole key.

Currently we have the following data types:

newtype Key = Key { unKey :: NonEmpty Text }

data TOML = TOML
    { tomlPairs  :: HashMap Key AnyValue
    , tomlTables :: HashMap Key TOML
    }

So each level can contain whole key. This is not really convenient actually... Because if you want to insert some key into such map, you need to scan whole HashMap in order to split by greatest common prefix. This sad and not really convenient...

On the other hand, storing only single key component per level simplifies things a lot!

So we actually want to have something like this:

data TOML = TOML
    { tomlPairs  :: HashMap Key AnyValue
    , tomlTables :: HashMap Text TOML  -- here 'Text' is only single component of 'Key'
    }

But with such scheme it's harder to implement pretty-printing for tables like this:

[foo.bar]
baz = 3

So we should analyze nested Tomls to not print table in this way:

[foo]
  [bar]  # we don't want this
    baz = 3

Also, during parsing we need to split every table key by component in insert them in this HashMap in some smart way. This is also undesirable.

Currently I'm having hard problems with these issues:

So let's implement our own data structure! Basically what we want is to have something like monomorphic for Key HashMap but which joins nodes by common prefix.

I think that data structure can look like this:

data PrefixTree a
    = Leaf (NonEmpty Text) a
    | Branch (NonEmpty Text) (PrefixTree a) (PrefixTree a)

type PrefixMap a = Maybe (PrefixTree a)

And we need to have the following basic functions and instances:

single :: Key -> a -> PrefixTree a
insert :: Key -> a -> PrefixTree a -> PrefixTree a
lookup :: Key -> PrefixTree a -> Maybe a
delete :: Key -> PrefixTree a -> Maybe (PrefixTree a)

instance Semigroup (PrefixTree a)  -- or PartialSemigroup
chshersh commented 6 years ago

Okay, so this structure doesn't work:

data PrefixTree a
    = Leaf (NonEmpty Text) a
    | Branch (NonEmpty Text) (PrefixTree a) (PrefixTree a)

Problems

There're multiple reasons for this:

  1. It doesn't account multiple entry points. For example, you can't store foo.bar and x.y in this structure since their common prefix is empty string. This can be workarounded by introducing fake empty string component to the beginning of each Key. But this doesn't really solve problem...
  2. Branch constructor has only two children while it actually can have multiple children. Let's say you have the following keys: a.b.x = 1 and a.b.z = 3. This will be stored in data structure like this:
Branch ("a" :| ["b"])
       (Leaf ("x" :| []) 1)
       (Leaf ("z" :| []) 3)

Now you want to insert a.b.y = 2 and you have problems... So Branch should contain multiple PrefixTrees.

Solutions

2. Multiple children

2 is simpler to fix. We need to modify structure like this:

data PrefixTree a
    = Leaf (NonEmpty Text) a
    | Branch (NonEmpty Text) (HashMap Text (PrefixTree a))

So instead of storing only two children we store HashMap from first component of resulting key to the rest PrefixTree containing this component. I personally don't like this structure because it has some redundancy (first component is stored twice). I will think on this more... The following keys a.x.y = 1 and a.b.c = 2 and a.1.2 = 15 will look like this in new data strucutre:

Branch ("a" :| [])
       { "x" → Leaf ("x" :| ["y"]) 1
       , "b" → Leaf ("b" :| ["c"]) 2
       , "1" → Leaf ("1" :| ["2"]) 15
       }

{} braces mean HashMap in this syntax. This structure allows to check whether we need to insert new Leaf in this HashMap or should we insert into existing PrefixTree.

1. Empty root

Regarding 1... Well, instead of having

type PrefixMap a = Maybe (PrefixTree a)

we just will have same HashMap for PrefixTree. So the resulting structures will look like this:

newtype Piece = Piece { unPiece :: Text }
newtype Key = Key { unKey :: NonEmpty Piece }
type Prefix = Key  -- not sure about this though...

data PrefixTree a
    = Leaf Key a
    | Branch Prefix (PrefixMap a)

type PrefixMap a = HashMap Piece (PrefixTree a)
chshersh commented 6 years ago

Okay, we have a problem. With this structure:

data PrefixTree a
    = Leaf Key a
    | Branch Prefix (PrefixMap a)

type PrefixMap a = HashMap Piece (PrefixTree a)

it's not possible to store both "a" and "a.b" at the same time which is a problem :slightly_frowning_face:

Need to think how to patch structure to not complicate things...

chshersh commented 6 years ago

After discussion with @vrom911 we decided to use the following structure:

data PrefixTree a
    = Leaf Key a
    | Branch Prefix  -- greatest common prefix
             (Maybe a)  -- value by key = prefix
             (PrefixMap a)  -- suffixes of prefix

type PrefixMap a = HashMap Piece (PrefixTree a)

This structure should allow us to solve all problems.