nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

Lines ending with a delimiter confuse hfs-delimited #228

Open methylene opened 10 years ago

methylene commented 10 years ago

I've noticed I get an exception when I use something like

((hfs-delimited "hfds:///a.txt") ?x ?y) 

as a generator, and a.txt contains some lines where the second token is empty, so that the line ends in \t\n. In this case, I get a "operation added wrong number of fields" exception.

I'm not sure whether it's a cascading or cascalog problem.

The cascalog version is 2.0.0, the hadoop version is 1.1.2.

methylene commented 10 years ago

Acually the lines I used had more tokens than 2. I hope it can be reproduced anyway.

methylene commented 10 years ago

Also, is there a way to disable quoting entirely on an hfs-delimited source tap? At the moment I just use some unlikely quote character like :quote "\u0003" but still have to strip that character from the input, along with delim chars

sritchie commented 10 years ago

Not sure on this one - you might check out the underlying implementation in Cascading? This would be a nice feature to add.

Quantisan commented 10 years ago

@methylene you can pass in :strict? false option into hfs-delimited and your missing field at the end would just be null. Regarding quote, I don't think you can disable it.

I'm going to close this issue for now. Feel free to re-open or msg me if this doesn't work for you.

methylene commented 10 years ago

Hi Paul, thanks for the tip, that's definitly helpful. But I think it's only a workaround. Maybe there is a misunderstanding? In my input data there is no "missing" field, but an empty one at the end of some line, right before the line break. In particular, there is no line in my input that doesn't have the correct number of delimiters. Using :strict? false will resolve the error, but it also disables a reasonable sanity check on each line. So yes I think this should be reopened.

Quantisan commented 10 years ago

@methylene delimited expects an ISO legit delimited file. Your best bet is to just read it in with hfs-textline then parse it manually. Does that answer your question?

methylene commented 10 years ago

Please explain, how a csv line with the correct number of tokens, but the last token empty, violates an ISO standard?

methylene commented 10 years ago

You mean, adding quote chars would make the problem go away?

methylene commented 10 years ago

So yes, I have yet to try

"a"\"t"b\t"" instead of a\tb\t

as input line, but I'm afraid that's not what you get as an output when using hfs-delimited or hfs-textline as a sink

methylene commented 10 years ago

When using hfs-delimited or hfs-textline as a sink tap, empty strings are encoded just the same way as nulll values, that can also be a problem

Quantisan commented 10 years ago

Is this test case corresponding to your problem? https://github.com/Quantisan/cascalog/commit/cac0119730b14035162a6e624f7edd4ad73353dc

methylene commented 10 years ago

I think there's something wrong with that test case. Yes it looks as if it confirms the issue. But the following fails, too:

(deftest delimited-empty-notlast-test
  (fact
    (io/with-fs-tmp [_ tmp]
      (?- (hfs-textline tmp) ;; write line
          [["Proin\t\thendrerit"]])
      (fact "Test hfs-delimited where last element is empty"
        (<- [?a ?b !c]
            ((hfs-delimited tmp :delimiter "\t") ?a ?b !c)) =>
        (produces [["Proin" nil "hendrerit"]])))))
methylene commented 10 years ago

Sorry, it should have been !b, not ?b. The test describes the problem well. The following test succeeds, but fails without the (remove (comp empty? last)) line:

(deftest delimited-empty-token-test
  (let [lines (->> [["0" "1" "2"] [""  "1" "2"]
                    ["0" ""  "2"] [""  ""  "2"]
                    ["0" "1" "" ] [""  "1" "" ]
                    ["0" ""  "" ] [""  ""  "" ]]
                   (remove (comp empty? last))
                   (vec))]
    (io/with-fs-tmp [_ tmp]
      (?- (hfs-textline tmp) ;; write line
          lines) 
      (fact (str "Test with nil at various positions") 
            (<- [!a !b !c]
                ((hfs-delimited tmp :delimiter "\t") !a !b !c)) =>
                (produces (mapv (partial mapv not-empty) lines))))))
methylene commented 10 years ago

By the way, I haven't used midje before, but I wonder why all tests in cascalog.more-taps-test have two (fact) forms, the outer one having only one argument and no "=>" symbol?

pepijndevos commented 10 years ago

I have this problem as well. My file uses commas and quotes, so stuff like

"col","another",""

I don't get any error, just no result at all. It's also not trapped.

This is the code I use to parse it

(defn log-source [path fields]
  (let [cfields (vec (map (partial str "?") fields))
        collect (mapfn [& values]
                  [(zipmap fields values)])
        log (hfs-delimited path
                   :outfields cfields
                   :delimiter ","
                   :compress? true
                   :strict? false)]
    (prn cfields)
    (<- [?map]
        (:trap (stdout))
        (:distinct false)
        (log :>> cfields)
        (collect :<< cfields :> ?map)
      )))

So if I understand correctly, ther is a testcase, but no fix?

I'm not sure what's going on here. It seems to matter which of these two lines I pick.

    (log :> ?field ?another _) ; works
    (log :> ?field ?another ?broken)
    (log :> ?field ?another !works) ; AHA!
sritchie commented 10 years ago

The cascading guys run their tests against Cascalog, so you should be able to include it in your project.clj and override the cascalog version. If it works, I'd love a pull req to bump the default cascading version.

Thanks, Sam

Pepijn de Vos mailto:notifications@github.com March 25, 2014 4:21 AM

It seems this is fixed upstream? https://groups.google.com/d/msg/cascading-user/9vy5H0gY8sk/kVM-YRPdycgJ

Can I use that with Cascalog?

— Reply to this email directly or view it on GitHub https://github.com/nathanmarz/cascalog/issues/228#issuecomment-38548752.

Sam Ritchie (@sritchie) Paddleguru Co-Founder 703.863.8561 www.paddleguru.com http://www.paddleguru.com/ Twitter http://twitter.com/paddleguru// Facebook http://facebook.com/paddleguru

pepijndevos commented 10 years ago

I deleted my comment because I realized they where talking about cascading 2.1, while we're already using 2.5, so the thread is irrelevant.

I also fixed my problem by using a Nullable variable, so the empty last column doesn't cause it to be excluded. Not sure if @methylene still has problesm.

methylene commented 10 years ago

@sritchie I've tried putting the [cascading/cascading-hadoop "2.6.0-wip-2"] into the cascalog-more-taps/project.clj in the d76a6b12083c11915b525988fa1ca1ad30063b36 commit, appended the test above to cascalog-more-taps/test/cascalog/more-taps-test.clj, then ran the test using lein test :only cascalog.more-taps-test/delimited-empty-token-test. But the issue seems to be still there.

methylene commented 10 years ago

It's not working, but here's a commit to add the test and bump the cascading version: 71d22c1618428d8de18c7112d257f1555e7029ef