nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

Idea: replace records by hashmaps #265

Closed methylene closed 9 years ago

methylene commented 9 years ago

This is an idea. How about we get get rid of all records in cascalog-core, like TailStruct, and replace them by simple maps. Just like in the clojurescript compiler, or clojure tools.analyzer.jvm.

Currently, defnode defines a record. Is this really necessary?

Records are not read-print-symmetric wich makes it harder to work with them in the repl

I don't know what the implication for jcascalog would be.

methylene commented 9 years ago

For example. In parse.clj there is the following, in a comment:

(v/with-logic-vars
  (parse-subquery [?x ?y ?z]
                  [[[[1 2 3]] ?x]
                   [* ?x ?x :> ?y]
                   [* ?x ?y :> ?z]]))

Now as an outsider to this project, or as someone evaluating it, I want to inspect that. So I evaluate it in the repl and get the following:

#cascalog.logic.parse.TailStruct{:identifier "847a003c-e586-4b84-910a-e10205c231eb", :node #cascalog.logic.parse.Projection{:identifier "af0c405d-75d5-4ef0-8852-d9db1eea6268", :source #cascalog.logic.parse.Application{:identifier "d10af19e-89e5-4d36-9127-3c15b461e9be", :source #cascalog.logic.parse.Application{:identifier "64b5da64-3ade-4411-961c-eab948dd260e", :source #cascalog.logic.predicate.Generator{:identifier "2f3675ff-abf2-4f67-a89a-63c432475bb3", :gen #cascalog.cascading.types.ClojureFlow{:source-map {"0c9a432b-eefb-4f7a-a7d0-438f3e369a70" #<MemorySourceTap MemorySourceTap["MemorySourceScheme[[UNKNOWN]->[ALL]]"]["/c8065b0c-3e52-4a55-9448-56ddf58fc49f"]>}, :sink-map nil, :trap-map {}, :tails nil, :pipe #<Each Each(74e2847f-c153-4d78-9b00-1d596225b555)[ClojureFilter[decl:ALL]]>, :name nil}, :fields ["?x"]}, :operation #cascalog.logic.predicate.Operation{:op #<clojure.lang.AFunction$1@35e20aca>, :input ("?x" "?x"), :output ["?y"]}}, :operation #cascalog.logic.predicate.Operation{:op #<clojure.lang.AFunction$1@b326f69>, :input ("?x" "?y"), :output ["?z"]}}, :fields ["?x" "?y" "?z"]}, :ground? true, :available-fields ["?x" "?y" "?z"], :operations ()}

This "parsed form" is apparently not very useful. I can not modify it and read it back in the repl, at least not with clojure 1.6.0. There are also two AFunctions which can certainly not be read back. Is it possible to modify parse-subquery, so that its output is more accessible, to make it easier to manipulate in the repl and verify if it does the right thing?

methylene commented 9 years ago

Also, how do I pretty-print the output of parse-subquery?

sritchie commented 9 years ago

I think a better idea would be to override the print methods - I THOUGHT I had done this for the record types, so that when you printed out the subquery everything looked great. Have you tried calling prn on the tail struct?

sritchie commented 9 years ago

Also, I was probably less clear than I should have been - the parsed representation converts the query into the DAG that a platform would accept. So it's probably not going to be that useful to look at anyway. What are you expecting parse-subquery to return?

sritchie commented 9 years ago

The fact is that if you allow anonymous functions, you can't keep around the property that a query should have a valid string serialization. Functions (both anonymous and var bound) aren't read/print symmetric either, nor are locally bound dynamic variables.

Defining the nodes as records gives them a type that the user can use when defining new platforms. We could make this change to copy Clojurescript, but I'm not seeing any wins, unless this is coming from someone who wants to take over maintenance of the codebase and sees working with maps to be easier.

Sounds like the problems here could be fixed by defining proper print functions for the records, so that they print more nicely at the repl, yeah?

methylene commented 9 years ago

@sritchie I would like for parse-subquery to return a map instead of a record because maps are "easier" and should be equivalent if we put the current record name in an :optype field and dispatch on that.

Maps don't suffer from problems like CLJ-1457. Records are probably faster, but that should not be an issue in the "compiler" part of cascalog.

I realize that records are used a lot throughout the code.

sritchie commented 9 years ago

I bet we could pretty easily make a function that recursively transforms those records into maps, maybe with a :type field assoc-Ed on. Would this help?

— Sent from Mailbox

On Thu, Nov 13, 2014 at 8:43 AM, methylene notifications@github.com wrote:

@sritchie I would like for parse-subquery to return a map instead of a record because maps are "easier". They don't have problems like CLJ-1457. Records are faster, but that should not be an issue in the "compiler" part of cascalog.

I realize that records are used a lot throughout the code.

Reply to this email directly or view it on GitHub: https://github.com/nathanmarz/cascalog/issues/265#issuecomment-62910888

methylene commented 9 years ago

@sritchie That sounds a good idea.

Yes anonymous functions would not be possible. But you can just factor them into named functions, which I've probably always done anyway. In my experience, it's not advisable to write a lot of code directly in the predicates.

sritchie commented 9 years ago

When you're working at a repl, you have to pass functions around anonymously, since they're not defined in any source - also, if you write a function that returns another function (partial, etc) you can't really bind ahead of time if using command line args. Anonymous functions are definitely important to support,though I agree that, as with any tool,usage can get out of hand:)

— Sent from Mailbox

On Thu, Nov 13, 2014 at 8:52 AM, methylene notifications@github.com wrote:

@sritchie That sounds a good idea.

Yes anonymous functions would not be possible. But you can just factor them into named functions, which I've probably always done anyway. In my experience, it's not advisable to write a lot of code directly in the predicates.

Reply to this email directly or view it on GitHub: https://github.com/nathanmarz/cascalog/issues/265#issuecomment-62912702

methylene commented 9 years ago

Can you give an example of an anonymous function in a predicate? I don't think I've seen that. For example, the following works

(??<- [?x]
      ([[1] [2] [3]] ?x)
      (even? ?x))
;; => ([2])

but the filter "even?" can apparently not be expressed as anonymous function

(??<- [?x]
      ([[1] [2] [3]] ?x)
      ((fn [x] (even? x)) ?x))
;; => FlowException
sritchie commented 9 years ago

Yeah, you need to use a special serializable Fn. Use cascalog's "mapfn" in place of fn.

— Sent from Mailbox

On Sun, Nov 16, 2014 at 10:22 AM, Lars Bohl notifications@github.com wrote:

Can you give an example of an anonymous function in a predicate? I don't think I've seen that. For example, the following works

(??<- [?x]
      ([[1] [2] [3]] ?x)
      (even? ?x))
;; => ([2])

but the filter "even?" can apparently not be expressed as anonymous function

(??<- [?x]
      ([[1] [2] [3]] ?x)
      ((fn [x] (even? x)) ?x))
;; => FlowException

Reply to this email directly or view it on GitHub: https://github.com/nathanmarz/cascalog/issues/265#issuecomment-63228422

methylene commented 9 years ago

Yes this works. Amazing, thanks!

(??<- [?x]
      ([[1] [2] [3]] ?x)
      ((cascalog.api/filterfn [x] (even? x)) ?x))
;; => ([2])

On a different note. I've noticed that parse-subquery is not a macro, which is why every predicate is converted to vector in the <- macro:

(parse-subquery ~outvars [~@(map vec predicates)])

That way, the symbols in the predicates get resolved and evaluated before they get passed to parse-subquery. Do you think it would also be possible to write parse-subquery as a macro, so we can do a portion of the work on the symbolic, unevaluated form of the predicates?

sritchie commented 9 years ago

I guess you could wrap it, like <- does; I'm not really sure what you mean by working with the predicates. The predicates are just vectors, of course, and you can think of all work before we submit to the cluster as "query optimization" time - so I'm not really sure what you gain by doing work at the actual Clojure compile time. 

— Sent from Mailbox

On Sun, Nov 16, 2014 at 11:37 AM, Lars Bohl notifications@github.com wrote:

Yes this works. Amazing, thanks!

(??<- [?x]
      ([[1] [2] [3]] ?x)
      ((cascalog.api/filterfn [x] (even? x)) ?x))
;; => ([2])

On a different note. I've noticed that parse-subquery is not a macro, which is why every predicate is converted to vector in the <- macro:

(parse-subquery ~outvars [~@(map vec predicates)])

That way, the symbols in the predicates get resolved and evaluated before they get passed to parse-subquery. Do you think it would also be possible to write parse-subquery as a macro, so we can do a portion of the work on the symbolic, unevaluated form of the predicates?

Reply to this email directly or view it on GitHub: https://github.com/nathanmarz/cascalog/issues/265#issuecomment-63231882

methylene commented 9 years ago

Maybe if query parsing was a macro, it would make things more predictable for some people who are not familiar yet with cascalog. Of course, the semantics would be subtly changed. A macro would / should probably not be able to tell the difference between a mapfn, a mapcatfn and a parallelagg, without passing additional info, maybe via new keyword 'operators'. So the query language (what's inside the (<- ...)) would probably have to become more explicit and verbose. On the other hand, weird stuff like having to use filterfn over fn (see above) would probably not be necessary.

sritchie commented 9 years ago

That's not really true - if you can write a macro that somehow can tell how I mean to use an operation, by all means I'd love to see an example.

The reason that a macro won't help is that a macro only has access to the code forms that you pass to it. So if I write something like

(let [square (fn [x] (* x x))]
  (<- [?x ?y]
    (data ?x)
    (square ?x :> ?y)))

There's no way for the <- macro to tell what square is at macro-expansion time. Is it a var? Is it a symbol? What's it bound to? Is it an anonymous function?

Macros delay evaluation and give you the ability to dissect the passed-in code as a data structure.

What you COULD do is change <- to recognize the fn form within its boundaries and substitute the serializable version. I think that would make things MORE confusing for new users, not less, because we start to have more magic rules (inline anonymous function definitions just work, but now if you take them as arguments to a function that builds a subquery suddenly things fail, just because you refactored!)

Reading your post again, I'm seeing that we agree that a macro wouldn't be able to tell the difference between the operators. I don't understand what a keyword operator is - can you clarify?

I'd love to see an example of how we can somehow detect and wrap the default non-serializable function and turn it into a serializable function, and how you imagine this new syntax working.

methylene commented 9 years ago

I think we can all agree that magic rules are non goals.

Keyword operators are for example :> or :<<. Yes the macro would only see symbols. The first symbol in a predicate vector (or predicate list, as it would then not be necessary to convert the predicates to vectors first) can be a symbol, a list (function expression), or a literal like a keyword. If it's a symbol, assume that it will resolve to a function when we're done. That should be all we need to know. We don't know yet whether it's an aggregator, a filter, a map or a mapcat. But in general the distinction can always be inferred from looking at its arguments, and the selection vector. In the remaining cases, where there is ambiguity, we can always resolve it by requiring an additional keyword operator, which should of course be avoided wherever it is not strictly necessary.

The beauty is that we would not even need to make a distinction between the symbol and list cases. We know these are both just ways to write a function. Leave it to a later phase of the compilation (build-rule) to evaluate these.

methylene commented 9 years ago

Please note these are just ideas, and I may be overlooking something. I'm not clear about the serialisation issue for instance. I've only read some parts of cascalog-core yet.

methylene commented 9 years ago

I think there may be a misunderstanding, I'm not talking about the <- macro but about parse-subquery, which is a fn. <- is already a macro but it's not doing much; basically it just wraps with v/with-logic-vars.

sritchie commented 9 years ago

re: the last comment, yes, of course - I was talking about a stronger version of <-. At the bottom you need to get to functions, so turning parse-subquery into a macro is equivalent to having <- do more work.

sritchie commented 9 years ago

Sorry, I'm still not seeing what this gains us. For example, you can't EVER distinguish between map and mapcat, since we don't have a type system that can give us more information. You have to hint it. I'm not seeing what power you gain by cluttering the DSL.

Also, "resolve to a function" isn't good enough for serialization, since you actually need access to the function body to make fn serialization work. That's why we have to use the special fn macro that mapfn, filterfn etc wraps.

Maybe you could show me an example of this new DSL you're imagining, along with a case where this macro version of parse-subquery gives more power, or resolves some complexities in the current implementation? I'm having a hard time understand what we might gain - also, I don't see how using a macro for this gives us any advantage over looking at a function.

If it's a function, there's no need to "assume" resolution - you can just inspect the arguments.

methylene commented 9 years ago

Patience, I'm working on it. The map / mapcat is tricky indeed. As for cluttering the DSL: I'd rather make the (<- ...) part self-contained, even if it means I have to write a bit more "clutter", than having to look at a bunch of functions that are defined outside the (<-...) and maybe not even in the same namespace.

methylene commented 9 years ago

*rather than having to look at them to see whether they are defined as a mapfn or a mapcatfn

methylene commented 9 years ago

As the implementor you probably don't gain much from doing more work in a macro. After all, not being able to resolve / evaluate things means you have less information about the code. But limiting the scope of what you can do in this way, may be a good thing after all. It may be easier to reason about what it does, for the caller.

methylene commented 9 years ago

Sorry. After playing with it a while, I've realized that making <- a macro is probably not a very good idea anyway. Otherwise the following could not be analyzed:

(let [x '[?x ?y]]
  (<- [?y]
      ([[1 2]] :>> x)))

because if <- was the macro, we would not know what x resolves to.

sritchie commented 9 years ago

@methylene yeah, it's a tricky problem. As I mentioned above, I've always that MapReduce applicationsa are special in that you get access to a "second compile time" before you hit the cluster, but in Clojure land without having to jump to the macro level.

With that in mind, the goal was to stick with immutable data structures as long as possible, so that you can do all your work with straight-up functions and only compile down to MapReduce at the last minute.

Thanks for brainstorming this through with me! Always happy to think through other stuff too.