Open methylene opened 10 years ago
Acually the lines I used had more tokens than 2. I hope it can be reproduced anyway.
Also, is there a way to disable quoting entirely on an hfs-delimited source tap? At the moment I just use some unlikely quote character like :quote "\u0003" but still have to strip that character from the input, along with delim chars
Not sure on this one - you might check out the underlying implementation in Cascading? This would be a nice feature to add.
@methylene you can pass in :strict? false
option into hfs-delimited and your missing field at the end would just be null. Regarding quote, I don't think you can disable it.
I'm going to close this issue for now. Feel free to re-open or msg me if this doesn't work for you.
Hi Paul, thanks for the tip, that's definitly helpful. But I think it's only a workaround. Maybe there is a misunderstanding? In my input data there is no "missing" field, but an empty one at the end of some line, right before the line break. In particular, there is no line in my input that doesn't have the correct number of delimiters. Using :strict? false
will resolve the error, but it also disables a reasonable sanity check on each line. So yes I think this should be reopened.
@methylene delimited expects an ISO legit delimited file. Your best bet is to just read it in with hfs-textline then parse it manually. Does that answer your question?
Please explain, how a csv line with the correct number of tokens, but the last token empty, violates an ISO standard?
You mean, adding quote chars would make the problem go away?
So yes, I have yet to try
"a"\"t"b\t"" instead of a\tb\t
as input line, but I'm afraid that's not what you get as an output when using hfs-delimited or hfs-textline as a sink
When using hfs-delimited or hfs-textline as a sink tap, empty strings are encoded just the same way as nulll values, that can also be a problem
Is this test case corresponding to your problem? https://github.com/Quantisan/cascalog/commit/cac0119730b14035162a6e624f7edd4ad73353dc
I think there's something wrong with that test case. Yes it looks as if it confirms the issue. But the following fails, too:
(deftest delimited-empty-notlast-test
(fact
(io/with-fs-tmp [_ tmp]
(?- (hfs-textline tmp) ;; write line
[["Proin\t\thendrerit"]])
(fact "Test hfs-delimited where last element is empty"
(<- [?a ?b !c]
((hfs-delimited tmp :delimiter "\t") ?a ?b !c)) =>
(produces [["Proin" nil "hendrerit"]])))))
Sorry, it should have been !b, not ?b. The test describes the problem well. The following test succeeds, but fails without the (remove (comp empty? last))
line:
(deftest delimited-empty-token-test
(let [lines (->> [["0" "1" "2"] ["" "1" "2"]
["0" "" "2"] ["" "" "2"]
["0" "1" "" ] ["" "1" "" ]
["0" "" "" ] ["" "" "" ]]
(remove (comp empty? last))
(vec))]
(io/with-fs-tmp [_ tmp]
(?- (hfs-textline tmp) ;; write line
lines)
(fact (str "Test with nil at various positions")
(<- [!a !b !c]
((hfs-delimited tmp :delimiter "\t") !a !b !c)) =>
(produces (mapv (partial mapv not-empty) lines))))))
By the way, I haven't used midje before, but I wonder why all tests in cascalog.more-taps-test have two (fact) forms, the outer one having only one argument and no "=>" symbol?
I have this problem as well. My file uses commas and quotes, so stuff like
"col","another",""
I don't get any error, just no result at all. It's also not trapped.
This is the code I use to parse it
(defn log-source [path fields]
(let [cfields (vec (map (partial str "?") fields))
collect (mapfn [& values]
[(zipmap fields values)])
log (hfs-delimited path
:outfields cfields
:delimiter ","
:compress? true
:strict? false)]
(prn cfields)
(<- [?map]
(:trap (stdout))
(:distinct false)
(log :>> cfields)
(collect :<< cfields :> ?map)
)))
So if I understand correctly, ther is a testcase, but no fix?
I'm not sure what's going on here. It seems to matter which of these two lines I pick.
(log :> ?field ?another _) ; works
(log :> ?field ?another ?broken)
(log :> ?field ?another !works) ; AHA!
The cascading guys run their tests against Cascalog, so you should be able to include it in your project.clj and override the cascalog version. If it works, I'd love a pull req to bump the default cascading version.
Thanks, Sam
Pepijn de Vos mailto:notifications@github.com March 25, 2014 4:21 AM
It seems this is fixed upstream? https://groups.google.com/d/msg/cascading-user/9vy5H0gY8sk/kVM-YRPdycgJ
Can I use that with Cascalog?
— Reply to this email directly or view it on GitHub https://github.com/nathanmarz/cascalog/issues/228#issuecomment-38548752.
Sam Ritchie (@sritchie) Paddleguru Co-Founder 703.863.8561 www.paddleguru.com http://www.paddleguru.com/ Twitter http://twitter.com/paddleguru// Facebook http://facebook.com/paddleguru
I deleted my comment because I realized they where talking about cascading 2.1, while we're already using 2.5, so the thread is irrelevant.
I also fixed my problem by using a Nullable variable, so the empty last column doesn't cause it to be excluded. Not sure if @methylene still has problesm.
@sritchie I've tried putting the [cascading/cascading-hadoop "2.6.0-wip-2"]
into the cascalog-more-taps/project.clj in the d76a6b12083c11915b525988fa1ca1ad30063b36 commit, appended the test above to cascalog-more-taps/test/cascalog/more-taps-test.clj, then ran the test using lein test :only cascalog.more-taps-test/delimited-empty-token-test
. But the issue seems to be still there.
It's not working, but here's a commit to add the test and bump the cascading version: 71d22c1618428d8de18c7112d257f1555e7029ef
I've noticed I get an exception when I use something like
as a generator, and a.txt contains some lines where the second token is empty, so that the line ends in \t\n. In this case, I get a "operation added wrong number of fields" exception.
I'm not sure whether it's a cascading or cascalog problem.
The cascalog version is 2.0.0, the hadoop version is 1.1.2.