rbheemana / Cobol-to-Hive

Serde for Cobol Layout to Hive table
Apache License 2.0
24 stars 23 forks source link

performance for large ebcdic data #23

Closed ankravch closed 5 years ago

ankravch commented 7 years ago

Hi Ram,

I wonder if you have tested Cobol-to-HIve on large ebcdic data (~1000 columns, ~100M rows, vb.length=32100)? Is this package designed to be used on large data in principle?

P.S. My workflow is to load ebcdic data in hive table and then save it to another format (for example parquet) from Spark.

Thanks, Anton

datawarlock commented 7 years ago

Anton -

I have used the Serde on the type of data you have described. I am using Spark to do the same. The process has been working quite well.

Thanks!

ankravch commented 7 years ago

thanks for your comment @datawarlock

When I try to save only two columns (out of ~1000) to parquet (in spark-shell) from hive ebcdic table, I see endless printout: "time stamp, Split=100%, Split No: ..., start Pos: 130089828, splitsize: 0 Records in split: 0" ......

with growing "Split No" value (about 100K per minute, doesn't look that good)

P.S. The original ebcdic binary file size is only 124MB (=130089828 B), and I have 100s Gb of such data...

@datawarlock Could please share your approximate benchmarks?

rbheemana commented 7 years ago

@ankravch Do you see any performance issues? Or is it only issue with logging?

ankravch commented 7 years ago

@rbheemana I’ve developed ebcdic conversion workflow using Hadoop/Hive/Spark. First step, I convert ‘Variable Record Length EBCDIC’ to ‘Fixed Record Length EBCDCIC’ (takes <0.5 h for 130GB ‘Variable Record Length EBCDIC’ data). Second step, I am using Hive ‘cobol2hive’ SerDe to convert from ‘Fixed Record Length EBCDCIC’ to parquet from Spark. Results were fully validated column by column against SAS.

File Group  N rows  N cols  Variable Record Length EBCDIC   Fixed Record Length EBCDIC   Parquet     SAS    CSV     ebcdic2parquet  parquet2csv 
 PB  18,559,770  1,628   90 GB   418 GB  8.5 GB  41 GB   63 GB  3.2 h    33 min
 OP  3,484,084  3,473    21 GB   101 GB  2.5 GB  13 GB   23 GB  1.8 h    18 min
 DM  1,457,277  1,336    4.6 GB  26 GB   393 MB  2.5 GB  4 GB   0.2 h    2.3 min
 IP  852,347     3,596   10 GB   25 GB   625 MB  1.9 GB  6 GB   0.5 h    4.5 min
 HH  488,901     3,358   3.1 GB  14 GB   320 MB  1.7 GB  3.2 GB 0.2 h    2.6 min
 SN  67,574  3,596   442 MB  2 GB    35 MB   245 MB  479 MB 2 min    38 s
 HS  57,844  3,456   668 MB  1.7 GB  76 MB    200 MB     407 MB 2.5 min  46 s

Results are shown in this table were produced by using a single machine (10 physical CPUs, 48GB RAM limit, 1 SSD) and local data

rbheemana commented 5 years ago

closing this issue as no changes to code are required, @ankravch let me know if you think otherwise.