schveiguy / fastaq

FastA/Q parser based on iopipe
1 stars 1 forks source link

Range implementation? #4

Open biocyberman opened 6 years ago

biocyberman commented 6 years ago

Steve @schveiguy

I am thinking about implementing filter function that allows filtering fasta entries by a regex/literal string. This prompts a need to subset the output of tokenParser function. Therefore thought about convert fasta.q to a InputRange or RandomAccessRange.

nextToken inside tokenParser looks very much like front(), and release is like popFront. Additionally, when pos == 0, range's empty is true. So it seems things are there already. But I have some doubts:

schveiguy commented 6 years ago

You can absolutely build a range out of the token parser. This is how iopipe's delimRange works: https://github.com/schveiguy/iopipe/blob/f44dcc8c55b89da2f78926c867310883cc5be547/source/iopipe/textpipe.d#L542

release appears at several level BufRef, FastaToken and Result. So what should be done with those release?

So a BufRef (and all the things that contain them) is a reference into a buffer without knowing the buffer type itself. It's simply the equivalent of a,b in the expression buffer[a .. b]. The reason it's used is because I don't want to store references to the actual data, that is not important until you want to use it, and it can easily become invalid.

The reason the release function exists is when you release data from the actual buffer, those numbers a,b must change to accommodate the new buffer data.

To be able to filter, the range should be more discrete.

I don't see why. A range only cares about the current element. A simple range on top of the nextToken function is trivially written:

struct FastaRange(Chain)
{
   private Chain chain; // the source parser
   private FastaToken tok; // the buffer reference
   auto front() { return tok.value(chain.window); }
   bool empty() { return tok.endPos == 0; }
   void popFront()
   {
      chain.release(tok.endPos);
      tok = chain.nextToken;
   }
}

Note that front is going to return a concrete token, and that allocates (the array of fields). Really, I need to develop an abstraction on an item that contains BufRef to make it so the allocation isn't happening every time you get the value.

How do I access previous items before nextToken?

This requires a different mechanism than a range, because a range is only interested in one element.

biocyberman commented 6 years ago

Hi Steve

Just to be clear, I was not criticizing the design of iopipe or fasta.d.

Regarding the release functions. My confusion stemmed from the difference between iopipe's release and fastaq's release:

while(lines.window.strip.empty)
{
// As in code to skip lines:
lines.release(lines.window.length); // This shifts forward the starting point of reference buffer of window.length
lines.extend(0); 
}

Whereas in BufRef it retracts the ending point of reference buffer:

struct BufRef
{
  size_t pos;
  size_t length;
  auto value(B)(B buf)
  { ...  }

  void release(size_t elements)
  {
    pos -= elements; //This retracts the end point of ref buf by length `elements`
  }
}

Back to the range: I tested the FastRange and it failed unittest: https://github.com/bioslaD/fastaq/tree/wiprange

Normally I would go back and read more about range, and do some debugging to find out myself. But I've spent more than one hour checking around. It probably takes you much less than that to get through this hurdle. So I will really appreciate if you could take a look. I am struck.

Further more, this usage of the range requires rather extra typing for its users (if I am doing correctly):

auto tokenizer2 = input.tokenParser;
alias ChainType = typeof(tokenizer2);
auto r = FastaRange!ChainType(tokenizer2, tokenizer2.nextToken);

Anyway we can get this instead? auto fastaRange = input.tokenParser. This would mean to implement range primitives inside the Result struct, but will greatly simplify its usage.

schveiguy commented 6 years ago

Just to be clear, I was not criticizing the design of iopipe or fasta.d.

I want to be clear here, I have no ego when it comes to my code that is broken or inferior. Yes I want it to succeed, but I'm OK if it's sub-optimal and there's something better to replace it! You don't have to worry about criticizing my code, ever :)

In particular I'm not in love with how the fasta parser works, I think there are better more efficient ways to do it. It was a quick-and-dirty version that I threw together essentially in one day. Yes it works, but we can do better.

My confusion stemmed from the difference between iopipe's release and fastaq's release

Yes, it may be better to rename the release function in BufRef. In previous implementation of BufRef (namely in the json parser), I did this manually, and it's much more pleasant to have a function that takes care of it, especially when we have a type with lots of them inside (FastaToken).

My end goal with this pattern is to formally implement it in iopipe for parsing as some sort of utility -- both the ability to use buffer references, and the ability to update them.

Back to the range: I tested the FastRange and it failed unittest:

My guess is this part of your unittest:

   auto r = FastaRange!ChainType(tokenizer2, tokenizer2.nextToken);

Here, you are constructing a FastaRange with a tokenizer that has NOT YET parsed any tokens, and then telling that tokenizer to parse a token. The problem is that your first parameter makes a copy of tokenizer2 and puts it into the resulting range. This one has an empty window!

Then, when you go to get the window of the empty iopipe inside the FastaRange (the one that was copied), it fails. See my proposed function below for the solution. It's important not to use a token on a different tokenizer when getting the concrete value.

Further more, this usage of the range requires rather extra typing for its users

Right, almost all ranges (and iopipes for that matter) are accompanied with a range-producing function. In D, there is only one way to get type inference for template parameters, and it's with a function. If you want to use UFCS, you need a global function.

So what you need is a function that produces one of these ranges:

auto fastaRange(Chain)(Chain chain)
{
   auto tokenizer = chain.tokenParser;
   auto result = FastaRange!(typeof(tokenizer))(tokenizer);
   result.popFront(); // prime the range, this properly stores the token and advances the iopipe
   return result;
}

//Usage:

auto r = input.fastaRange;
biocyberman commented 6 years ago

I want to be clear here, I have no ego when it comes to my code that is broken or inferior. Yes I want it to succeed, but I'm OK if it's sub-optimal and there's something better to replace it! You don't have to worry about criticizing my code, ever :)

I like that we set the ground firmly and clearly because things might get misinterpreted via written communication.

Yes, it may be better to rename the release function in BufRef. In previous implementation of BufRef (namely in the json parser), I did this manually, and it's much more pleasant to have a function that takes care of it, especially when we have a type with lots of them inside (FastaToken).

I will go back to this in the future.

Here, you are constructing a FastaRange with a tokenizer that has NOT YET parsed any tokens, and then telling that tokenizer to parse a token. The problem is that your first parameter makes a copy of tokenizer2 and puts it into the resulting range. This one has an empty window!

Spot on! And the fastaRange function works as intended. I got the new unittest passed.