patrickhuber / Pliant

MIT License
26 stars 4 forks source link

Update Lexemes and ParseRunner to use Span<T> #96

Closed patrickhuber closed 3 years ago

patrickhuber commented 5 years ago

Currently Lexemes use StringBuilders under the covers to capture input. In cases where multiple Lexemes apply to a segment of text, the text is duplicated for each Lexeme. This causes multiple copies of the memory to exist even though only one may be used in the parse.

span acts as a pointer to memory and stores the locations in the source string instead of storing a copy.

Update the lexemes to use Span and remove any references to string builder. These may need to be ref Span in order to allow reallocation of the base string for growth and to allow the Span to increase its length as characters are recognized.

patrickhuber commented 5 years ago

On second thought, Span may not work because it is unable to live on the heap which is a requirement of a lexeme.

Memory may be an option, but there are cases where the underlying structure will need to grow which would require reallocation of the base string and render the reference of Memory invalid.

Instead something like an accumulator can be used, where the accumulator can return a readonly snapshot. The readonly snapshot would retain a pointer to a StringBuilder which can dynamically grow and avoids the problem of orphaned references.

public class Accumulator
{
     private StringBuilder _stringBuilder;
     public Accumulator(StringBuilder stringBuilder)
     {
          _stringBuilder = stringBuilder;
     }

     public void Accumulate(char c)
     {
         _stringBuilder.Append(c);
     }

     public AccumulatorSpan Slice(int start, int length)
     {
          return new AccumulatorSpan(_stringBuilder, start, length);
     }
}

public class AccumulatorSpan
{
    private StringBuilder _stringBuilder;
    private int _start;
    private int _length;

    public int Start
    {
        get { return _start; }
        set
        {
             if (value < 0)              
                 throw new IndexOutOfBoundsException("value can not be less than 0");

             if (value >= _stringBuilder.Length)             
                 throw new IndexOutOfBoundsException("value can not be greater than or equal to length");

             _start = value;
        }
    }

   public int Length
   {
       get { return _length; }
       set 
       {
            if (value > _start + _stringBuilder.Length)
                 throw new IndexOutOfBoundsException("length can not be longer than offset + builder length");
           if (value < 0)
                throw new IndexOutOfBoundsException("length can not be less than zero");
           _length = value;
       }
   }
}

In the above example, the ParseRunner would keep track of the accumulator and each Lexeme would track an AccumulatorSpan. The ParseRunner would add every character to the Accumulator and Lexemes would increment their length if they end up accepting input.

The existing Lexeme.Free methods could still be used to manage a pool of AccumulatorSpan objects or the same reference could be used and offset and length could be adjusted instead.

patrickhuber commented 4 years ago

It may also be possible to write a parse runner that loads the entire input into memory and segments that memory using System.Memory. This would have a different interface than IParseRunner.

Maybe IStreamableParseRunner vs IMemoryParseRunner

patrickhuber commented 3 years ago

Ended up writing a custom solution that works off of char index based objects like System.Text.StringBuilder and System.String. This does away with shifting around string builders and reduces the memory needed to handle multiple lexemes in parallel. Now there is only one copy of the input stream managed by the ParseRunner. When characters are read, they are added to the underlying StringBuidler. Every lexeme gets a pointer to this StringBuilder through an ICapture. ICapture never exposes the underlying data structure so the underlying can be expanded if IsReadOnly = false.

This also replaced many of the '.Value' issues in lexemes where they needed to be snapshotted into strings during parts in the parse in order to support debugging. Now the debugger will get a snapshot on demand which will not impact the internal state of the object. Consumers of the ICapture can easily pass its reference around without converting to a string. Any place where a string was required in performance critical code has been replaced with a reference to ICapture.