ssadler / hawk

Awk for Hoodlums
BSD 3-Clause "New" or "Revised" License
35 stars 2 forks source link

Typeclass for input #5

Closed gelisam closed 11 years ago

gelisam commented 11 years ago

Thanks giving me commit access!

Since I won't have to send you pull requests, I plan to describe the features I want to implement as issues, so we can discuss them beforehand if needed. If my suggestions deviate too much from your vision, feel free to object!

Anyway, this particular issue is about revamping the smart render system so that in addition to allowing different types for output, we can also accept different types of input. Examples:

ByteString -> r? Apply the function to each line. [ByteString] -> r? Apply the function to the entire input. [(String, Int, Int)] -> r? Parse the input as tab-separated fields. Aeson.Object -> r? Parse the entire input as json, then pass it to the function.

gelisam commented 11 years ago

An alternative choice for ByteString -> r would be to accept the entire input, without splitting it into lines. But I think modifying each line is going to be more common, so I would prefer to optimize for that use case.

Whichever choice we make, users can easily add unlines or fmap to get the other behaviour if needed.

The same choice exists for JSON, but I think the opposite use case is going to be more common: a JSON input spanning multiple lines. If users want to interpret the input as one JSON per line, they can write a function of type [Aeson.Object] -> r.

ssadler commented 11 years ago

Yep! I think this is an amazing idea. I actually considered it for detecting functions of type ByteString -> r but it didn't occur to me to take it any further. My favourite part:

[(String, Int, Int)] -> r? Parse the input as tab-separated fields.

This * 100000. I considered trying to wrap ByteString's readInt to be a bit more useful, but it seems clear now that a type-declerative solution is the way to go.

I'm actually not a fan of trying to be too clever with the input though; currently the function input is a list of one item per line, and I'd like to keep it that way. The reason is that I believe the most common usecases are always going to be [a] -> [b] and [a] -> b, ie, map / filter / reduce. If I write a function that expects a single input it's more likely that I forgot to use map than I actually wanted the whole file unsplit.

I also think that the input interface should be consistent, and ideally, composable; commonly a stream of data from Hadoop would be: key\tjson, so it would be nice to be able to turn that into (key,a,b). Perhaps lenses will help out here?

We could have a function with a similar api as the json function, ie using a witness which splits on tabs and produces a series of tuples (ie, a -> [b] -> [a]). Given that, if there's some nice way of making it work by introspecting the input type signature, then that could be provided as an additional luxury on top. What do you think?

gelisam commented 11 years ago

I agree that map, filter and reduce are going to be the most common use cases. I'm not suggesting to unsplit the file and give the raw input stream to the function; on the contrary, I'm suggesting to silently add this map the user forgot.

The only situation in which I suggest to unsplit the input is when the input type is Aeson.Object, indicating that the entire file is a single JSON object. I'm actually quite surprised that there are formats which stuff their entire JSON on a single line; that must be so horribly unreadable! Anyway, the input (String, Aeson.Object) would still be parsed as one pair per line, so both of our use cases would be covered. I'm not familiar with lenses, and I don't think I need them to implement the behaviour I have described.

Speaking of things I am not familiar with: I don't understand your json trick well enough to generalize it, so I plan to stick with typeclass magic for now.

ssadler commented 11 years ago

Speaking of things I am not familiar with: I don't understand your json trick well enough to generalize it, so I plan to stick with typeclass magic for now.

It's not actually my trick... Basically, the witness can have any type that corresponds to a FromJSON instance. Then, Aeson does all the work when you call fromJSON.

I would have made the whole thing totally implicit, however, the numeric functions (sum, mod, log etc) tend to support numeric typeclasses so you have to specify a concrete type anyway.

ssadler commented 11 years ago

I agree that map, filter and reduce are going to be the most common use cases. I'm not suggesting to unsplit the file and give the raw input stream to the function; on the contrary, I'm suggesting to silently add this 'map' the user forgot.

Ah, +1 then!

gelisam commented 11 years ago

I would have made the whole thing totally implicit, however, the numeric functions (sum, mod, log etc) tend to support numeric typeclasses so you have to specify a concrete type anyway.

Have you considered default instances? Not that I've ever used them.

ssadler commented 11 years ago

Interesting. So this may work.

Prelude> default (Float)
Prelude> let a = sum [1.1,2,2]
Prelude> :t a
a :: Float
Prelude> default (Double)
Prelude> let a = sum [1.1,2,2]
Prelude> :t a
a :: Double
gelisam commented 11 years ago

I have pushed the above commit to my own repo rather than yours, because it breaks almost all the examples in the README. Since the input and output types can now be anything, GHC has a really hard time inferring the type of the user's code. I don't think this can be avoided...

> printf '1\t2' | hsl 'map swap :: [(Int, Int)] -> [(Int, Int)]'
2   1

Unless you can save us using some type witness magic?

ssadler commented 11 years ago

The witnesses are really simple, they just concretise the types in the eyes of the compiler. They have no logic or purpose at runtime at all. An easy way to use them:

witness :: a -> a -> a
witness _ v = v

Then in your expression:

map swap . witness [(i,i)]

Then the compiler knows exactly what it's supposed to be dealing with. Could remove them from json as such:

witness [(i,i)] . json ""
ssadler commented 11 years ago

I'm running this and it doesn't work even for simple examples:

echo '1 a' | hsl length

/var/folders/9h/w42hpwwj5j3bqx8cymz3jx7w0000gn/T/tmpcjBHCe/Main.hs:33:8:
    No instance for (ParsableFromLine a0) arising from a use of `run'
    The type variable `a0' is ambiguous

Could you explain a bit about the design goal of the new classes in Types.hs?

Also, I don't really want to use Read. Reason being, sadly, it's a major performance killer... I don't know why this is but, it is. ByteString's intRead is nice and fast, though.

Edit: I should mention that before changing any of the typeclasses in Types.hs, I'd like for this to be implemented as a simple function in a separate module.

gelisam commented 11 years ago

You can give ParsableFromItem Int a concrete implementation if you want, like I did for String and ByteString. I didn't know Read was so slow; I still can't reason properly about Haskell's time- or space- performance.

gelisam commented 11 years ago

Yes, this is exactly the problem I was referring to: my change is no good, because even the simplest expression fails type inference now. The reason is that length has the polymorphic type [a] -> Int, but the compiler cannot infer what the type of a should be.

When you only had the Renderable constraint, the compiler was unifying [a] -> Int with the context Renderable r => [ByteString] -> r, for which GHC could easily conclude that a = ByteString and r = Int.

The new design goal is to allow the user's function to have type [(String, Int, Int)] -> Int, or [(Int, Int)] -> String, or whatever tuple type the user wants to manipulate. To do this, we need to add a second constraint, Parsable a, with instances instance Parsable (String, Int, Int), instance (Int, Int), and so on. But now, when the user writes a polymorphic function of type [a] -> Int, GHC needs to unify that type with (Parsable a, Renderable r) => [a] -> r, and GHC has no way to know whether a = (String, Int, Int) or a = (Int, Int). Of course, for the function length the result with be the same regardless of our choice, but Haskell still needs a concrete type in order to proceed.

To alleviate the problem, I overwrote GHC's type-inference (this is simpler than it sounds...) to assume ByteString whenever it can, and that helped a lot. I plan to improve the inference further, by forcing [a] for Functor a, forcing Int for Num a, and by removing -defer-type-errors and run during type-inference.

But not today! I have already spent way too much time on this project this weekend. It was fun, though, thanks for starting this interesting project!

gelisam commented 11 years ago

Edit: I should mention that before changing any of the typeclasses in Types.hs, I'd like for this to be implemented as a simple function in a separate module.

I don't see how that would be possible. The whole point is for the behaviour to be different depending on the type, and for this, you need typeclasses!

ssadler commented 11 years ago

See my branch feature/tabular.. You're right, it requires changes in the types, but not any additions. Although, it's still using the same API, ie explicit.

gelisam commented 11 years ago

Since type-inference magic will need to be reimplemented anyway, I think we should close this issue and replace it with a family of smaller tasks.