nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

??- returns only the last tuple of a sequence #294

Open ghost opened 8 years ago

ghost commented 8 years ago

The following input on cascalog.playground:

(??-
   (<- [?p ?age]
       (age ?p ?age)))

returns

 [["luanne" 36] ["luanne" 36] ["luanne" 36]  ["luanne" 36] ["luanne" 36] ["luanne" 36] ["luanne" 36] ["luanne" 36] ["luanne" 36] ["luanne" 36]  ]```

However, running 
```clojure
  (?- (stdout)
   (<- [?p ?age]
       (p/age ?p ?age)))

gives the correct result (10 unique names and ages).

sritchie commented 8 years ago

What cascading version and cascalog versions are you using? This reminds me of an iterator bug we fixed a while ago.

— Sent from Mailbox

On Sat, Oct 31, 2015 at 6:15 PM, Timothy Galebach notifications@github.com wrote:

The following input on cascalog.playground:

(??-
   (<- [?p ?age]
       (age ?p ?age)))

returns

 [["luanne" 36] ["luanne" 36] ["luanne" 36]  ["luanne" 36] ["luanne" 36] ["luanne" 36] ["luanne" 36] ["luanne" 36] ["luanne" 36] ["luanne" 36]  ]```
However, running 
```clojure
  (?- (stdout)
   (<- [?p ?age]
       (p/age ?p ?age)))

gives the correct result (10 unique names and ages).

Reply to this email directly or view it on GitHub: https://github.com/nathanmarz/cascalog/issues/294

ghost commented 8 years ago

I'm using cascalog 2.1.1.

I haven't explicitly declared anything wrt cascading; I've just been following the project's readme to get started. Relevant portion of project.clj below:

  :dependencies [[org.clojure/clojure "1.7.0"]
                 [cascalog "2.1.1"]]
  :profiles { :dev {:dependencies [[org.apache.hadoop/hadoop-core "1.2.1"]]}}
  :jvm-opts ["-Xms768m" "-Xmx768m"])
sritchie commented 8 years ago

Yeah, this is fixed in 3.0.0-SNAPSHOT, which I think I the latest version off of master. Want to give that a shot? We're due for a new release for sure.

— Sent from Mailbox

On Sat, Oct 31, 2015 at 5:22 PM, Timothy Galebach notifications@github.com wrote:

I'm using cascalog 2.1.1. I haven't explicitly declared anything wrt cascading; I've just been following the project's readme to get started. Relevant portion of project.clj below:

  :dependencies [[org.clojure/clojure "1.7.0"]
                 [cascalog "2.1.1"]]
  :profiles { :dev {:dependencies [[org.apache.hadoop/hadoop-core "1.2.1"]]}}
  :jvm-opts ["-Xms768m" "-Xmx768m"])

Reply to this email directly or view it on GitHub: https://github.com/nathanmarz/cascalog/issues/294#issuecomment-152787525

ghost commented 8 years ago

Same issue occurs with these dependencies:

  :dependencies [[org.clojure/clojure "1.7.0"]
                 [cascalog/cascalog-core "3.0.0-SNAPSHOT"]]

Is there a working project.clj I could take a look at? Once this gets resolved I'm guessing it will come down to a documentation issue, and I'm happy to submit a pull request for that. I also had some initial frustrations because the documentation doesn't mention needing to run (bootstrap-emacs) in cider, so that should probably be fixed as well.

sritchie commented 8 years ago

For some reason my internet connection's preventing me from launching a repl (by blocking dependency downloads in leiningen), but I THINK, based on a different bug, I have a guess about what's causing this. Can you give this branch a try?

https://github.com/nathanmarz/cascalog/pull/295

Check out the discussion here: https://github.com/nathanmarz/cascalog/issues/251

Along with this fix: https://github.com/nathanmarz/cascalog/pull/280

for some more background on the issue. Also, Any updates on documentation you want to send over would be huge.

ghost commented 8 years ago

Trying that branch now, trying to build it and put in the local repo, but running into the issue that the sub-modules (cascalog-checkpoint, midje, etc) depend on cascalog-core, so I'm not able to compile them initially. I don't usually structure projects like this--how do you compile this structure?

sritchie commented 8 years ago

Ah, sorry- first, run "lein sub install" in the base directory. Thanks for trying this out!

— Sent from Mailbox

On Sun, Nov 1, 2015 at 12:45 PM, Timothy Galebach notifications@github.com wrote:

Trying that branch now, trying to build it and put in the local repo, but running into the issue that the sub-modules (cascalog-checkpoint, midje, etc) depend on cascalog-core, so I'm not able to compile them initially. I don't usually structure projects like this--how do you compile this structure?

Reply to this email directly or view it on GitHub: https://github.com/nathanmarz/cascalog/issues/294#issuecomment-152868008

ghost commented 8 years ago

OK, that works for compilation/local repo installation. Unfortunately the bug still persists. If it's helpful, the log output in the repl says that Cascading 2.5.3 is being used currently.

Thanks for the help so far! Have a project I'm transitioning over to hadoop as it's grown a lot, and I'd really like to go with cascalog on it, so hopefully can sort this out.

sritchie commented 8 years ago

This looks very related to #292. The folks over at that ticket figured out that this issue only shows up with Clojure 1.7.0.

ghost commented 8 years ago

OK, I'll try going back to 1.6, thanks!

ghost commented 8 years ago

That fixed it. I'm going to submit a pull request for docs that are a bit more current in a bit.

metasoarous commented 8 years ago

This just bit me as well; Can confirm that switching to 1.6 fixes the issue, but it would be nice to have a 1.7 compatible fix.

sritchie commented 8 years ago

@metasoarous totally hear you. I'm happy to review any pull requests from folks who want to take this on! I'm not using Cascalog for my work these days, so I don't have time to fix bugs like this myself, but I am available on a consulting basis to fix bugs or add features.

metasoarous commented 8 years ago

Hi @sritchie: I appreciate the offer. Right now, 1.7 isn't critical for us, but if it becomes necessary we'll keep that in mind. I mostly just wanted to add a second data point for posterity's sake :-)

jiyouyou125 commented 8 years ago

http://dev.clojure.org/jira/browse/CLJ-1738

1.7 Compatibility Notes: iterator-seq change, it could help ?

Direction of this ticket changed at Rich's request.

Prior description capture here:

Clojure code that uses iterator-seq to wrap Java iterators that return the same mutable object on every call are broken by the chunked iterator-seq changes from CLJ-1669.

Some examples where this occurs:

Hadoop ReduceContextImpl$ValueIterator Mahout DenseVector$AllIterator/NonDefaultIterator LensKit FastIterators Cause: In 1.6, the iterator-seq wrapper could be used with these to consume a sequence over these iterators element-by-element. In 1.7 RC1, iterator-seq produces a chunked sequence. Because next() is called 32 times on the iterator before the first value can be retrieved from the seq, and the same mutable object is returned every time, code doing this now receives different (incorrect) results.

Approach: Switch iterator-seq back to non-chunked and change eduction to use the chunking iterator-seq strategy as that was the original target. Retain the use of the chunked iterator seq in sequence over the TransformerIterator.

jiyouyou125 commented 8 years ago

only ??- ??<- use iteraltor-seq

sritchie commented 8 years ago

@nightlord this is really interesting, and probably the reason for the bug. Looks like a change like this may work:

(defn iter-seq [iter f]
  (if (.hasNext iter)
    (lazy-seq
      (cons (f (.next iter))
            (iter-seq iter f)))))
jiyouyou125 commented 8 years ago

@sritchie it fix ??-, maybe not enough good, but sure it's problem. https://github.com/nathanmarz/cascalog/pull/296

jiyouyou125 commented 8 years ago

@sritchie fix ??-, ci build problem, add profile 1.6,1.7.

build success.