shamim8888 / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

CSV reads every other line when header supplied #867

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Import a CSV file with a header
2.
3.

What is the expected output? What do you see instead?
I want all the data, but only half is read into the database.

Please use labels and text to provide additional information.
It looks like the line counter is not getting update for each new line. Thus 
the header is ignored code is used for each record read.

Original issue reported on code.google.com by ecarm...@ucr.edu on 2 Apr 2015 at 12:19

GoogleCodeExporter commented 8 years ago
Please give me some more reproduction details... there is a test case in the 
system, and I've exercised it fairly extensively with local filesystem files 
successfully.

It is known that reading CSV files with a header won't work when reading from 
HDFS, is that your use case?

Original comment by c...@lambda.nu on 2 Apr 2015 at 12:34

GoogleCodeExporter commented 8 years ago
I am using a single node cluster on my local machine.

Sample data: 
row_id,sid,date,time,day,duration,mode,category,is_FB
1694754,49,4/16/14,19:13:36,1,6,url,Uncategorized,0
1694755,49,4/16/14,19:13:44,1,6,url,Online Service,0
1694756,49,4/16/14,19:13:50,1,2,url,Uncategorized,0
1694757,49,4/16/14,19:13:53,1,1,url,Academic,0
1694758,49,4/16/14,19:13:54,1,13,url,Uncategorized,0
1694759,49,4/16/14,19:14:08,1,5,url,Uncategorized,0
1694760,49,4/16/14,19:14:14,1,103,url,Uncategorized,0

Queries:
drop dataverse test if exists;
create dataverse test;
use dataverse test;

create type LogTypeRaw as open {
    row_id: int64,
    sid: int64,
    date: string,
    time: string,
    day: int64?,
    duration: int64?,
    mode: string?,
    category: string?,
    is_FB: int64?
}

create dataset Log_raw (LogTypeRaw)
primary key row_id, sid, date, time;

load dataset Log_raw using localfs 
    (("path"="127.0.0.1:///path/to/test.csv"), 
    ("format"="delimited-text"), 
    ("header"="true"));

count( for $x in dataset Log_raw return $x);

The result should be a count of 7 rows, yet only 4 exist.

Original comment by ecarm...@ucr.edu on 2 Apr 2015 at 4:28

GoogleCodeExporter commented 8 years ago
I tried exactly this query and dataset here, and the result was 7 as expected. 
This was built using the current tip of Hyracks and Asterix, specifically SHAs 
38dea13 and 385bfd8 respectively.

What version of AsterixDB are you running? I assume you're building from 
source, so what SHA are you synced to? The initial support for parsing headers 
at all was introduced in fda0725 on Feb. 13, but there was indeed a fix for 
line-counting put in at e1a2ff8 on Feb. 19. If you happen to be better those 
revisions, that would explain what you see.

Original comment by c...@lambda.nu on 3 Apr 2015 at 4:35

GoogleCodeExporter commented 8 years ago
s/better/between/

Original comment by c...@lambda.nu on 3 Apr 2015 at 5:34

GoogleCodeExporter commented 8 years ago
My branch (ecarm002/intervals) is based on 385bfd8 master for AsterixDB. It 
should be the latest master version as this code has been rebased and is about 
to be merged.

Original comment by ecarm...@ucr.edu on 3 Apr 2015 at 4:57

GoogleCodeExporter commented 8 years ago
I got the actual input file from Preston and it turns out that it is using an 
unusual line-ending scheme - it only has carriage-return \r characters between 
lines. With that file, I can indeed reproduce the issue.

The bug I fixed in e1a2ff8 had to do with line-endings as well (in that case 
supporting CRLF), so I'm hoping it will be a similar simple fix.

Original comment by c...@lambda.nu on 3 Apr 2015 at 6:06

GoogleCodeExporter commented 8 years ago
I've implemented a change which fixes this case. The parsing code is actually 
in Hyracks now so I will propose the change there. Additionally, I've added new 
test cases in Asterix for parsing CSV with headers with CR, LF, and CRLF 
endings, all of which now pass with the updated Hyracks (the CR test previously 
failed).

Original comment by c...@lambda.nu on 4 Apr 2015 at 12:09

GoogleCodeExporter commented 8 years ago
Hyracks fix: http://fulliautomatix.ics.uci.edu:8443/#/c/246/

New AsterixDB test cases: http://fulliautomatix.ics.uci.edu:8443/#/c/247/

Preston, can you try patching my Hyracks fix into your build and verifying that 
it fixes the problem? If so, and assuming the test run for the asterix change 
doesn't show any new failures, we can submit this.

Original comment by c...@lambda.nu on 4 Apr 2015 at 6:19

GoogleCodeExporter commented 8 years ago
The fix worked on my csv file.

Original comment by ecarm...@ucr.edu on 7 Apr 2015 at 8:11