Closed ankravch closed 5 years ago
Anton -
I have used the Serde on the type of data you have described. I am using Spark to do the same. The process has been working quite well.
Thanks!
thanks for your comment @datawarlock
When I try to save only two columns (out of ~1000) to parquet (in spark-shell) from hive ebcdic table, I see endless printout: "time stamp, Split=100%, Split No: ..., start Pos: 130089828, splitsize: 0 Records in split: 0" ......
with growing "Split No" value (about 100K per minute, doesn't look that good)
P.S. The original ebcdic binary file size is only 124MB (=130089828 B), and I have 100s Gb of such data...
@datawarlock Could please share your approximate benchmarks?
@ankravch Do you see any performance issues? Or is it only issue with logging?
@rbheemana I’ve developed ebcdic conversion workflow using Hadoop/Hive/Spark. First step, I convert ‘Variable Record Length EBCDIC’ to ‘Fixed Record Length EBCDCIC’ (takes <0.5 h for 130GB ‘Variable Record Length EBCDIC’ data). Second step, I am using Hive ‘cobol2hive’ SerDe to convert from ‘Fixed Record Length EBCDCIC’ to parquet from Spark. Results were fully validated column by column against SAS.
File Group N rows N cols Variable Record Length EBCDIC Fixed Record Length EBCDIC Parquet SAS CSV ebcdic2parquet parquet2csv
PB 18,559,770 1,628 90 GB 418 GB 8.5 GB 41 GB 63 GB 3.2 h 33 min
OP 3,484,084 3,473 21 GB 101 GB 2.5 GB 13 GB 23 GB 1.8 h 18 min
DM 1,457,277 1,336 4.6 GB 26 GB 393 MB 2.5 GB 4 GB 0.2 h 2.3 min
IP 852,347 3,596 10 GB 25 GB 625 MB 1.9 GB 6 GB 0.5 h 4.5 min
HH 488,901 3,358 3.1 GB 14 GB 320 MB 1.7 GB 3.2 GB 0.2 h 2.6 min
SN 67,574 3,596 442 MB 2 GB 35 MB 245 MB 479 MB 2 min 38 s
HS 57,844 3,456 668 MB 1.7 GB 76 MB 200 MB 407 MB 2.5 min 46 s
Results are shown in this table were produced by using a single machine (10 physical CPUs, 48GB RAM limit, 1 SSD) and local data
closing this issue as no changes to code are required, @ankravch let me know if you think otherwise.
Hi Ram,
I wonder if you have tested Cobol-to-HIve on large ebcdic data (~1000 columns, ~100M rows, vb.length=32100)? Is this package designed to be used on large data in principle?
P.S. My workflow is to load ebcdic data in hive table and then save it to another format (for example parquet) from Spark.
Thanks, Anton