rbheemana / Cobol-to-Hive

Serde for Cobol Layout to Hive table
Apache License 2.0
24 stars 23 forks source link

(post comments) Cobol to Hive Serde blog #3

Open rbheemana opened 8 years ago

willddy commented 8 years ago

seems the code is not complete. There is no part to decode from ebcdic to ascii.

rbheemana commented 8 years ago

Hey please try it works, we are using in our system. Here is the exact line of code which does the decoding part in CobolDeserializer. byte[] temp = transcodeField(fieldBytes, Charset.forName("ebcdic-cp-us"),Charset.forName("ascii"),columnName);

willddy commented 8 years ago

thanks. Did not look into details. I am able to create the hive table from copybook. However select * from table returns empty. Do you mind add pom.xml to the project and remove/comment out failed test cases? Or share a workable jar so I can test if it is sth wrong with data or code.

manichinnari555 commented 8 years ago

Hello Willddy,

I am testing it but I am getting the following error(FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.savy3.hadoop.hive.serde2.cobol.CobolSerde ), can you please help me how did you managed to use the serde. I have performed the following steps but still the serde issue hasnt solved.

  1. serde starts with com not org 2. Need to specify input format and output format 3. try placing the copybook in hdfs path since it is big and use cobol.layout.url 4. specify the fixed length property (if it is fixed length file) It should something look like below: ADD JAR CobolSerde.jar CREATE EXTERNAL TABLE default.mani ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde2.cobol.CobolSerde' INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/user/manichinnari/bin_file.txt' TBLPROPERTIES ('cobol.layout.url'='/user/manichinnari/copybooks/CKNNEWTN.txt','fb.length'='450');
willddy commented 8 years ago

my email willddy@gmail.com. Feel free to contact me. Need to see your copybook. As far as I know, there are couple of limitation for this right now

willddy commented 8 years ago

rbheemana, does it support multiple schema in the copy book. For example, data and trailer has different schema, but in the same data file?

rbheemana commented 8 years ago

if there are multiple 01 levels it does not. However if you use redefines clause for header and trailer in same 01 level group it works absolutely fine.

sekharpgs commented 8 years ago

Hi rbheemana, we have tried this serde to run the mainframeVBfile. But it is throwing the below error.

2016-05-18 15:03:40.797, Split = 32%, Split No: 3 start Pos: -49805 splitsize: 49865 Records in split: 1 Failed with exception java.io.IOException:java.io.IOException: Partial record(length = -49805) found at the end of file hdfs://localhost:8020/user/hive/warehouse/ptab28/combdatdpd.txt

copybook: 01 WS-VAR. 05 WS-NAME PIC X(12). 05 WS-MARKS-LENGTH PIC 9(2). 05 WS-marks OCCURS 0 to 6 TIMES DEPENDING ON WS-MARKS-LENGTH. 10 WS-MARK PIC 999. 05 WS-NICKNAME PIC X(6).

inputdata(EBCDIC binary): ÂÉÓÓ@âÔÉã@@@@öø÷ùø÷ùø÷ùø÷ùø÷ùø÷ùÂÉÓÓ

Corresponding ASCII text: BILL SMIT 6879879879879879879BILL

commands used:

CREATE TABLE ptab28 ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' STORED AS INPUTFORMAT 'com.savy3.mapred.MainframeVBInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' TBLPROPERTIES ('cobol.layout.url'='/user/input/vbcopy.txt');

load data local inpath 'vbtext.txt' into table ptab38;

Could you please help as quick as possible. Please reach me at sekhar.pgs@gmail.com

rbheemana commented 8 years ago

Hi, While transferring VB file from mainframe please make sure you enable RDW byte. Based on the error I believe the file is missing RDW byte. Thanks,Ram

njagadam commented 8 years ago

Hi Ram,

I am pulling a VB file from mainframe to linux and i have enabled the RDW byte using the command "quote site RDW" but when trying to run the query i am getting the following error..

Hive Runtime Error while processing writable 83 3f 00 00 00 00 00 0f 00 00 00 42 71 6c 40 40 40 00 0c 04 27 16 11 8c 00 1f 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:505) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170) ... 8 more Caused by: java.lang.IllegalArgumentException: 81 > 64 at java.util.Arrays.copyOfRange(Arrays.java:2549) at com.savy3.hadoop.hive.serde3.cobol.CobolField.getBytes(CobolField.java:101) at com.savy3.hadoop.hive.serde3.cobol.CobolStringField.deserialize(CobolStringField.java:57) at com.savy3.hadoop.hive.serde3.cobol.CobolGroupField.deserialize(CobolGroupField.java:236) at com.savy3.hadoop.hive.serde3.cobol.CobolGroupField.deserialize(CobolGroupField.java:241) at com.savy3.hadoop.hive.serde3.cobol.CobolGroupField.deserialize(CobolGroupField.java:241) at com.savy3.hadoop.hive.serde3.cobol.CobolGroupField.deserialize(CobolGroupField.java:241) at com.savy3.hadoop.hive.serde3.cobol.CobolToHive.deserialize(CobolToHive.java:28) at com.savy3.hadoop.hive.serde3.cobol.CobolSerDe.deserialize(CobolSerDe.java:62) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:136) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:100) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:496)

Thanks

rbheemana commented 8 years ago

Please verify cobol.hive.mapping property immediately after the creation of table. This will list out cobol fields and their offset, length. This should resolve cobol copybook layout problems, once that is resolved. Please let me know if your still facing the issue. Thanks,Ram

njagadam commented 8 years ago

Hi Ram,

Thanks for your response. I am trying to find the log for cobol.hive.mapping and couldnt find it. can you please suggest as where i can find the cobol.hive.mapping..

Thanks!

rbheemana commented 8 years ago

It is a configuration variable that is set while creating table.So after creation of table from the hive prompt, use 'set cobol.hive.mapping;' command on hive prompt to list the value. Thanks,Ram

njagadam commented 8 years ago

Hi Ram,

Thanks! I did the steps and still getting this error..

Failed with exception java.io.IOException:java.io.IOException: Partial record(length = -14835) found at the end of file hdfs://quickstart.cloudera:8020/user/cloudera/source

Thanks

manichinnari555 commented 8 years ago

Hello,

May I know how your uploading/transferring the binary file to hdfs?

Regards, Venkat

Sent from my iPhone

On May 31, 2016, at 2:06 PM, njagadam notifications@github.com wrote:

Hi Ram,

Thanks! I did the steps and still getting this error..

Failed with exception java.io.IOException:java.io.IOException: Partial record(length = -14835) found at the end of file hdfs://quickstart.cloudera:8020/user/cloudera/source

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

rafaelpiresm commented 8 years ago

Hi, manichinnari555 . Are you solved your last issue (FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.savy3.hadoop.hive.serde2.cobol.CobolSerde)? If so, can you share your approach, please? I´m been in the same problem! PS.: I´m not using Cloudera stack, but Hortonworks (HDP 2.4) Cheers, Rafael

njagadam commented 8 years ago

Hi Venkat,

I tried multiple options..

1) ftp user pwd binary quote site rdw

2) ftp user pwd binary type e quote site rdw

3) ftp user pwd quote mode b type e quote site rdw

etc..

Thanks

manichinnari555 commented 8 years ago

Hello rafaelpiresm,

This is related to the access issue, you would need to login/work with application id rather then your personal id, if your working in SAN box, then it shouldn't be a problem. Thanks.

manichinnari555 commented 8 years ago

Hello njagadam ,

Then it think you would need to check the copybook layout the file length, please make sure both of them are matching.

Regards, Venkat

njagadam commented 8 years ago

Hi Ram,

when i create a new table and does a describe on it.. its showing the table layout of the previous table.. any reasons and when i do the set cobol.hive.mapping, this is also showing old layout(created using previous cobol copy book). Can you please point out as why its doing it and what should we do for this not to show this..?

Thanks

rbheemana commented 8 years ago

That is weird, I will try to replicate the scenario at my end. For you to get going. Before creation of new table try below command And then create table let me know if it is issue even after doing that.

njagadam commented 8 years ago

I tired the command set cobol.hive.mapping="" before creating the table but its still showing the previous table ddl and strange this it is showing the previous table ddl even when i droped the previous table.

I dropped all the tables and did a set cobol.hive.mapping and its still showing the first table layout creation.

Is this saved in the buffer which it is using and not getting overwritten or am i missing something..?

Thanks

rbheemana commented 8 years ago

Will try to do replicate this issue in the evening or tomorrow. Can't tell anything without seeing the issue.

manichinnari555 commented 8 years ago

Hi,

Can I know how your creating the table is it with HUE or Beeline? I can't what is issue exactly, even I faced it, but try creating in beeline Cli or HIve cli it should solve the issue.

Regards, Venkat

njagadam commented 8 years ago

Hi Venkat,

I created the table using Hive cli..

Thanks

manichinnari555 commented 8 years ago

Hello,

Yes, I am also getting similar issue and sometimes it is working, I am not sure what is the issue.

Regards, Venkat

manichinnari555 commented 8 years ago

Yes,

It looks like it is storing the ddl in the buffer and using it, if you come out of cli and login back it seems working, please let me know if that works. Thanks

rafaelpiresm commented 8 years ago

Hi, guys! Thank you, manichinnari555. The jar that i have´d downloaded was corrupted. Now it works, but I´m getting an error message, when I try to create a table that looks like the sample provided in the documentation:

"Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table (state=08S01,code=1)"

My command is: "CREATE TABLE Cobol2Hive ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' STORED AS INPUTFORMAT 'com.savy3.mapred.MainframeVBInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 'hdfs:///path/to/file.bin' TBLPROPERTIES ('cobol.layout.url'='hdfs:///path/to/file.cbl');"

I´m using beeline and HDP 2.3.

Cheers, Rafael

manichinnari555 commented 8 years ago

Hello,

I got the same same error while executing the script in the beeline cli and got to know if the beeline is open for a long time it seems to occur, I would suggest log out and close the window and reopen, most probably the issue should resolve, let me know how it goes, thanks.

But, how did you resolved the earlier issue of DDL ?

njagadam commented 8 years ago

i logged out of hive and logged in..now i am getting below error when i am try to select * from table;

Failed with exception java.io.IOException:java.lang.IllegalArgumentException: 56 > 54

The file is a variable block and has comp3 values in it.

Thanks!

manichinnari555 commented 8 years ago

Hello Ram,

Did you got a chance to look into this issue?

njagadam commented 8 years ago

Hi,

When i drop and recreate the table in hive cli, it is showing new ddl. but in hue its still showing the old ddl of the table. Any suggestions on how to resolve this issue. Thanks!

Thanks

MuraliKandimalla commented 8 years ago

Select * from table fetching zero records from the hive table. Anyone know how to fix this issue or what could be the possible issue?.

josephlim75 commented 8 years ago

Hi rbheemana,

I have tried your tool and everything works. Creating table and reading data is no problem. Unfortunately I have ran into a very strange issue, when I create external table A that points to copybook A, that works fine. Then I create another external table B pointing to another copybook B in a different location, the original table A structure has been replace with table B. Now Table A and B is having the same structure. Is this a problem with Hive or the CobolSerde.Jar itself ?

I have seen the same problem above is faced by njagadam, Some recommendation by manichinnari555 to exit the cli and relogon or set cobol.hive.mapping="", both approached doesn't seems to work. Any clue ? Your help is very much appreciated. You can email me at josephlim.intl@yahoo.com if you happen to know the problem

willddy commented 8 years ago

Pls. post your hive sql for reproduce

josephlim75 commented 8 years ago

The below is what I used to create the table tatrxn. When I tried create another table ctptrxn, with total different structure, it just overwrite the tatrxn.

CREATE EXTERNAL TABLE tatrxn ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/raw/tatrxn/data' TBLPROPERTIES ('cobol.layout.url'='/raw/tatrxn/copybook/tatrxn.cbl','fb.length'='87');

I have tried this, also no luck.
CREATE EXTERNAL TABLE tatrxn ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/raw/tatrxn/data' TBLPROPERTIES ('cobol.layout.url'='/raw/tatrxn/copybook/tatrxn.cbl','cobol.hive.mapping'='','fb.length'='87');

The ctptrxn DDL is as follows

CREATE EXTERNAL TABLE tatrxn ROW FORMAT SERDE 'com.savy3.hadoop.hive.serde3.cobol.CobolSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/raw/ctptrxn/data' TBLPROPERTIES ('cobol.layout.url'='/raw/ctptrxn/copybook/ctptrxn.cbl','fb.length'='180');

I have tried issue set cobol.hive.mapping="" before calling create external table, also no luck.

Sample output from my putty hive> desc ctptrxn; OK firstname varchar(19) from deserializer lastname varchar(19) from deserializer age varchar(3) from deserializer cardnumber varchar(25) from deserializer trxntype varchar(3) from deserializer trxnamount varchar(10) from deserializer trxndate varchar(8) from deserializer Time taken: 0.522 seconds, Fetched: 7 row(s) hive> desc tatrxn; OK firstname varchar(19) from deserializer lastname varchar(19) from deserializer age varchar(3) from deserializer cardnumber varchar(25) from deserializer trxntype varchar(3) from deserializer trxnamount varchar(10) from deserializer trxndate varchar(8) from deserializer Time taken: 0.382 seconds, Fetched: 7 row(s)

I am running the HDP 2.4 sandbox that comes with Hive 1.2 version if I am not mistaken

In case you need my copybook, here is tatrxn.cbl

01 TRXN-RECORD. 05 FIRSTNAME PIC X(19). 05 LASTNAME PIC X(19). 05 AGE PIC X(3). 05 CARDNUMBER PIC X(25). 05 TRXNTYPE PIC X(3). 05 TRXNAMOUNT PIC X(10). 05 TRXNDATE PIC X(8).

You can use any other different copybook to simulate and see if it overwrites TATRXN structure

manichinnari555 commented 8 years ago

Hello,

Try existing and re-logging in and see.

josephlim75 commented 8 years ago

Hi Mani,

I have already tried, I have even went to the level of restarting my HIVE service before creating the 2nd table. The strange thing is that, when you issue the 2nd table DDL and describe the table, it is still using the 1st table DDL. When you restart the HIVE service, then both 1st and 2nd table is using the latest DDL. I know it looks kind of strange, but this is what I am experiencing. Do you or anyone have this problem at all ?

manichinnari555 commented 8 years ago

Yes, I had similar issue.

willddy commented 8 years ago

are you guys using metadata store on derby or MYSQL like database?

manichinnari555 commented 8 years ago

Hello,

Are you using HUE or cli to create the tables?

josephlim75 commented 8 years ago

@Will: I am not sure, but I don't see MySQL service is running from HDP VM, therefore, I assume it is derby.

@Mani: As I have responded earlier with my putty output, it is HIVE CLI, not HUE nor BEELINE. As you have stated you have the same problem, do you able to resolve by using HUE for the table creation ?

will commented 8 years ago

I've worked on Postgres hosting for the last 6 years though, so maybe that's the issue.

josephlim75 commented 8 years ago

@Will: Just to confirm again, so you are using Postgres as your HIVE storage, and you no longer having this problem ? are you using HIVE CLI ? or HUE to create though ? Before i go ahead and configure the metastore storage to Postgres, I just want to try out if HUE makes a different

willddy commented 8 years ago

PostgreSQL should be good

Thanks, Dayong

gvickie commented 8 years ago

I am getting the below error when I am trying to run this code:

Logging initialized using configuration in jar:file:/C:/hadoop/hive-0.12.0.2.0.8.0-1639/lib/hive-common-0.12.0.2.0.8.0-1 639.jar!/hive-log4j.properties Added C:\Users\HDIinyvikg\AppData\Local\Temp\6\a6d1b94f-5402-4804-8784-f85c453a48dd_resources\CobolSerde.jar to class pa th Added resource: C:\Users\HDIinyvikg\AppData\Local\Temp\6\a6d1b94f-5402-4804-8784-f85c453a48dd_resources\CobolSerde.jar FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Class not found: org.apache.hadoop.m apred.FixedLengthInputFormat

Can you please help us resolving this issue. I am reachable at g_vickie84@yahoo.co.in

josephlim75 commented 8 years ago

Hi Guys,

I have tested with Postgres, seems no luck. I am not sure what am I missing. I have connected to the hive metastore to look at the COLUMNS_V2 table to confirm if the columns for the table are generated as duplicate and it is indeed duplicates.

Btw, my previous metastore was MySQL and not Derby, it is default from Hortonworks Sandbox. I really do not know if this is something to do with my environment or the JAR itself.

@Will: Do you have a sample of your Create Table DDL that works that you can share ? you can send to me at josephlim.intl@yahoo.com

josephlim75 commented 8 years ago

Hi All,

@Mani: It works for the way that you have suggested, which is exit the CLI and enter again into the hive CLI. This way works. I think somehow the CLI initialize the hive session everything it initiate. But this initialization only works in the hive CLI world.

The problem is more on using HUE, ZEPPELIN, AMBARI HIVE Views and all sort of the tools. When you first start using any of this tool, and let say you query Table A, somehow this is always cache. Therefore, when I describe Table B, it still give me Table A structure. When I restart the HIVE service, this time, I describe Table B for the first time, it shows me Table B structure. The moment I describe Table A structure, it is showing Table B structure.

Is there a way from the tools or any configuration that it reinitialize or reset the cache ? Is there a way I could clear the cache before issuing a query. Example:

set clear; desc TableA;

or

desc TableB set clear; select * from TableA;

willddy commented 8 years ago

I have tested with below two tables. They all work fine. Note, ADD statement is only ok for each hive session. In order to use it globally, you need either add it manually where you query the table or add it globally, refer here https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_mc_hive_udf.html https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cm_mc_hive_udf.html

ADD JAR /home/vagrant/mainframe-serde-0.0.1-SNAPSHOT.jar; CREATE EXTERNAL TABLE test1 ROW FORMAT SERDE ‘your serve class' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/data/test1' TBLPROPERTIES ('cobol.layout.url'='/cpbk/test1/CTRL_DATA.COPYBOOK.TXT','fb.length'='68');

ADD JAR /home/vagrant/mainframe-serde-0.0.1-SNAPSHOT.jar; CREATE EXTERNAL TABLE test2 ROW FORMAT SERDE ‘your serve class' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.FixedLengthInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/data/test2' TBLPROPERTIES ('cobol.layout.url'='/cpbk/test2/CTRL_DATA.COPYBOOK.TXT','fb.length'='68’);

rbheemana commented 8 years ago

Hi All, First things first. Sorry for delay in response, got busy with bread producing work :-) I read all the issues and comments. And replicated the problem myself. I tried to summarize the issue under https://github.com/rbheemana/Cobol-to-Hive/issues/12.

Please have a look and feel free to suggest me with details to update the description.

After a sleepless night, finally figured out the issue and updated the code. I have also updated the jar, please try.

I have also updated the column comments with corresponding copy book field, offset and its length.

Special thanks to @willddy , @josephlim75, @manichinnari555 and @njagadam for identifying and trying out the alternate approaches to resolve. It really helped in the analysis.

Thank you all for the patience. Cheers!!