ogrodnek / csv-serde

Hive SerDe for CSV
Apache License 2.0
141 stars 80 forks source link

Add a property to skip the first N rows #4

Open ccady opened 12 years ago

ccady commented 12 years ago

Many CSV files have a "title row" e.g. :

id,lastname,firstname
"12323","Washington","George"
"12343","Lincoln","Abraham"

It would be nice to be able to set a "linesToSkip" property to skip the first N rows:

row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "linesToSkip" = 1
)

(I am so not set up to do any Java coding -- If I were, I'd do it myself and submit a patch.)

luisfmrosa commented 10 years ago

This functionality cannot be implemented at SerDe level as it controls only the row format. There is no information of current position in file for deserialize() method to do that.

To implement a linesToSkip functionality, we should implement a custom input format (perhaps extending TextInputFormat: http://hadoop.apache.org/docs/r1.2.1/api/index.html).

RickardCardell commented 10 years ago

what about defining the csv-headers in the table's serdeproperties? The serde can then skip the line which matches the header-spec:

create table 
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "header" = "id,lastname,firstname"
)

As of now I need a step that removes the header for each csv-file which is quite cumbersome: One table and a serde is, without this feature, not enough to parse a csv-file.

EDIT: I forgot that a serde always need to output one row (unless it throws an exception but that is rather ugly).

luisfmrosa commented 10 years ago

The problem is that it will make a string comparison for every row in the file, so a performance killer.

The best would be to extend RecordReader and skip desired lines on initalize() method after calling parent's method. If split.getStart() == 0 then we read N dummy lines. Then we extend TextInputFormat to use this new class.

Then we could use it like this :

mapred.skipNLines=3;

CREATE TABLE mytablewithheader(a string, b string, c string)
STORED AS INPUTFORMAT 'com.mypackage.extdtextinputformat.ExtTextInputFormat'"+
        "OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

A good starting point: http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/ https://github.com/edwardcapriolo/DualInputFormat/blob/master/pom.xml

Not sure this serde package would be the best place to put this code.

sfr commented 8 years ago

This is solved since Hive 0.13.0

create external table testtable
( name    string
, message string
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ( 'separatorChar' = ','
                     , 'quoteChar'     = '"'
                     , 'escapeChar'    = '\\' )
stored as textfile
location '/path/to/testtable'
tblproperties ( 'skip.header.line.count' = '1'
              , 'skip.footer.line.count' = '2' );