Open GoogleCodeExporter opened 9 years ago
Please give me some more reproduction details... there is a test case in the
system, and I've exercised it fairly extensively with local filesystem files
successfully.
It is known that reading CSV files with a header won't work when reading from
HDFS, is that your use case?
Original comment by c...@lambda.nu
on 2 Apr 2015 at 12:34
I am using a single node cluster on my local machine.
Sample data:
row_id,sid,date,time,day,duration,mode,category,is_FB
1694754,49,4/16/14,19:13:36,1,6,url,Uncategorized,0
1694755,49,4/16/14,19:13:44,1,6,url,Online Service,0
1694756,49,4/16/14,19:13:50,1,2,url,Uncategorized,0
1694757,49,4/16/14,19:13:53,1,1,url,Academic,0
1694758,49,4/16/14,19:13:54,1,13,url,Uncategorized,0
1694759,49,4/16/14,19:14:08,1,5,url,Uncategorized,0
1694760,49,4/16/14,19:14:14,1,103,url,Uncategorized,0
Queries:
drop dataverse test if exists;
create dataverse test;
use dataverse test;
create type LogTypeRaw as open {
row_id: int64,
sid: int64,
date: string,
time: string,
day: int64?,
duration: int64?,
mode: string?,
category: string?,
is_FB: int64?
}
create dataset Log_raw (LogTypeRaw)
primary key row_id, sid, date, time;
load dataset Log_raw using localfs
(("path"="127.0.0.1:///path/to/test.csv"),
("format"="delimited-text"),
("header"="true"));
count( for $x in dataset Log_raw return $x);
The result should be a count of 7 rows, yet only 4 exist.
Original comment by ecarm...@ucr.edu
on 2 Apr 2015 at 4:28
I tried exactly this query and dataset here, and the result was 7 as expected.
This was built using the current tip of Hyracks and Asterix, specifically SHAs
38dea13 and 385bfd8 respectively.
What version of AsterixDB are you running? I assume you're building from
source, so what SHA are you synced to? The initial support for parsing headers
at all was introduced in fda0725 on Feb. 13, but there was indeed a fix for
line-counting put in at e1a2ff8 on Feb. 19. If you happen to be better those
revisions, that would explain what you see.
Original comment by c...@lambda.nu
on 3 Apr 2015 at 4:35
s/better/between/
Original comment by c...@lambda.nu
on 3 Apr 2015 at 5:34
My branch (ecarm002/intervals) is based on 385bfd8 master for AsterixDB. It
should be the latest master version as this code has been rebased and is about
to be merged.
Original comment by ecarm...@ucr.edu
on 3 Apr 2015 at 4:57
I got the actual input file from Preston and it turns out that it is using an
unusual line-ending scheme - it only has carriage-return \r characters between
lines. With that file, I can indeed reproduce the issue.
The bug I fixed in e1a2ff8 had to do with line-endings as well (in that case
supporting CRLF), so I'm hoping it will be a similar simple fix.
Original comment by c...@lambda.nu
on 3 Apr 2015 at 6:06
I've implemented a change which fixes this case. The parsing code is actually
in Hyracks now so I will propose the change there. Additionally, I've added new
test cases in Asterix for parsing CSV with headers with CR, LF, and CRLF
endings, all of which now pass with the updated Hyracks (the CR test previously
failed).
Original comment by c...@lambda.nu
on 4 Apr 2015 at 12:09
Hyracks fix: http://fulliautomatix.ics.uci.edu:8443/#/c/246/
New AsterixDB test cases: http://fulliautomatix.ics.uci.edu:8443/#/c/247/
Preston, can you try patching my Hyracks fix into your build and verifying that
it fixes the problem? If so, and assuming the test run for the asterix change
doesn't show any new failures, we can submit this.
Original comment by c...@lambda.nu
on 4 Apr 2015 at 6:19
The fix worked on my csv file.
Original comment by ecarm...@ucr.edu
on 7 Apr 2015 at 8:11
Original issue reported on code.google.com by
ecarm...@ucr.edu
on 2 Apr 2015 at 12:19