jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

Non-Avro files (ie, compressed ones) being given .avro extension #9

Open koertkuipers opened 12 years ago

koertkuipers commented 12 years ago

my data comes out of external tables backed by avro nicely when i just do "select * from table limit 10;"

however as soon as i do anything that involves map-reduce the data becomes garbled.

for example: hive> select * from edges limit 5; OK 347eb58d-9d73-4087-980d-cdd4f9bcffc7 9b7a42e1-8986-48c8-a933-09fecf459289 0d7f380e-0a5c-4e36-af96-5aec6872b67a 3bccb2e1-389a-4443-a1e4-72deb8ed9da3 5ac84201-4adf-460c-91d8-d4f26c73cbe3 7804d00d-5bf0-4d51-82c1-0c95faec0156 fdb87eda-472c-40ec-8807-90e930c3fd09 1e46aa69-63dc-4ee7-b25c-09645c402589 5ba3ae16-c037-486b-b599-710c77976667 224cfcb1-7be6-4c5b-83d3-918bb3bdee27 hive> create table edges_1 as select * from edges;
Total MapReduce jobs = 2 ... hive> select * from edges_1 limit 10; OK ��w!p73a2b0ae-bfa5-42fe-bd8a-607c5bbf5623 4c504f74-b518-4381-aa00-b0f4ad1ecd0e 0f3c4d08-934f-4ba2-94e7-53e4910f1095 d01b4177-9c9c-4417-P /e-93a004a03a4e 12331383-21fb-4b75-ba3a-dece3256ea90 1ea29a89-4f5eD 6a2-824e-af6e01808a9d 3b7ffcd8-1bff-4035-8650-234bea646c48 7ed7c3f1-905f-H4e-9522-4b5b5a0befa0 6c7db398-7c75-457b-af20-32f7eefe9285 941f1ac5-afc0d a-a30f-864d0d956249 cf7def1f-a925-46d L -8cd74xef0 �
72-4d768385940a NULL $ 6255463d-941c-4ab4-8ee6-26c0b428b72c �

not sure what is going on here... the only haivvreo related logging i could see was: 2011-12-07 22:08:52,397 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Unable to match filesplit /data/edges/part-00002.avro:0+30819496 with a partition.

jghoman commented 12 years ago

Is there a schema set for the new table (describe extended and check for schema.literal or schema.url)?

koertkuipers commented 12 years ago

the new table is created in hive's default format.

hive> describe extended edges_1; OK id1 string
id2 string

Detailed Table Information Table(tableName:edges_1, dbName:default, owner:koert, createTime:1323313762, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id1, type:string, comment:null), FieldSchema(name:id2, type:string, comment:null)], location:hdfs://node01/user/hive/warehouse/edges_1, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{transient_lastDdlTime=1323313762}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)

koertkuipers commented 12 years ago

this is the avro backed table: hive> describe extended edges;
OK id1 string from deserializer id2 string from deserializer

Detailed Table Information Table(tableName:edges, dbName:default, owner:koert, createTime:1323311884, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://node01/data/edges, inputFormat:com.linkedin.haivvreo.AvroContainerInputFormat, outputFormat:com.linkedin.haivvreo.AvroContainerOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.linkedin.haivvreo.AvroSerDe, parameters:{schema.url=hdfs://node01:8020/data/schemas/edges.json, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1323311884}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) Time taken: 0.095 seconds

koertkuipers commented 12 years ago

and this is how i created it: CREATE EXTERNAL TABLE edges ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ( 'schema.url'='hdfs://node01:8020/data/schemas/edges.json') STORED as INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/edges/';

jghoman commented 12 years ago

Created tables don't bring along their serde properties. The new table (edges_1) doesn't have a schema.url to tell it where to get a reader schema for. Use alter table set serdeproperties (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTableProperties) to define schema.url to be the same as in edges. Although, if this is the problem, we definitely need to handle the exception better.

koertkuipers commented 12 years ago

but isn't edges_1 a simple hive table backed by textformat?

i get the same issue with simple querues on edges (the avro backed table), even if i dont save the results in a new table. any query that involves mapreduce will return garbled results even on hive command line. for example:

hive> select * from edges limit 5;
OK 347eb58d-9d73-4087-980d-cdd4f9bcffc7 9b7a42e1-8986-48c8-a933-09fecf459289 0d7f380e-0a5c-4e36-af96-5aec6872b67a 3bccb2e1-389a-4443-a1e4-72deb8ed9da3 5ac84201-4adf-460c-91d8-d4f26c73cbe3 7804d00d-5bf0-4d51-82c1-0c95faec0156 fdb87eda-472c-40ec-8807-90e930c3fd09 1e46aa69-63dc-4ee7-b25c-09645c402589 5ba3ae16-c037-486b-b599-710c77976667 224cfcb1-7be6-4c5b-83d3-918bb3bdee27 Time taken: 3.249 seconds hive> select * from edges where id1='fdb87eda-472c-40ec-8807-90e930c3fd09'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201111011704_1307, Tracking URL = http://node01:50030/jobdetails.jsp?jobid=job_201111011704_1307 Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=node01:8021 -kill job_201111011704_1307 2011-12-08 00:05:21,154 Stage-1 map = 0%, reduce = 0% 2011-12-08 00:05:32,219 Stage-1 map = 9%, reduce = 0% 2011-12-08 00:05:35,238 Stage-1 map = 21%, reduce = 0% 2011-12-08 00:05:38,260 Stage-1 map = 58%, reduce = 0% 2011-12-08 00:05:39,269 Stage-1 map = 82%, reduce = 0% 2011-12-08 00:05:42,289 Stage-1 map = 91%, reduce = 0% 2011-12-08 00:05:45,310 Stage-1 map = 100%, reduce = 0% 2011-12-08 00:05:47,321 Stage-1 map = 100%, reduce = 100% Ended Job = job_201111011704_1307 OK NULL JN[fdb87eda-472c-40ec-8807-90e930c3fd09 1e46aa69-63dc-4ee7-b25c-09645c402589  NULL NULL NULL NULL NULL NULL Time taken: 29.391 seconds hive>

jghoman commented 12 years ago

Ah, sorry, it was late last night. It's interesting because along with some gibberish, one of the strings comes out OK. Are the data compressed? Would it be possible to get a sample of the data file so I can see if it works for me?

koertkuipers commented 12 years ago

morning. the last query should return only one result. is it possible that it did, but the result was messed up in a way that caused it to render it to the display with multiple lines and with nulls and such?

for example see what happens if i modify the query like this: hive> select count(*) as cnt from edges where id1='fdb87eda-472c-40ec-8807-90e930c3fd09'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201111011704_1310, Tracking URL = http://node01:50030/jobdetails.jsp?jobid=job_201111011704_1310 Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=node01:8021 -kill job_201111011704_1310 2011-12-08 12:25:10,291 Stage-1 map = 0%, reduce = 0% 2011-12-08 12:25:19,343 Stage-1 map = 4%, reduce = 0% 2011-12-08 12:25:22,361 Stage-1 map = 17%, reduce = 0% 2011-12-08 12:25:23,368 Stage-1 map = 21%, reduce = 0% 2011-12-08 12:25:25,381 Stage-1 map = 40%, reduce = 0% 2011-12-08 12:25:26,389 Stage-1 map = 78%, reduce = 0% 2011-12-08 12:25:27,397 Stage-1 map = 82%, reduce = 0% 2011-12-08 12:25:30,418 Stage-1 map = 91%, reduce = 0% 2011-12-08 12:25:31,427 Stage-1 map = 100%, reduce = 0% 2011-12-08 12:25:33,441 Stage-1 map = 100%, reduce = 100% Ended Job = job_201111011704_1310 OK 1 Time taken: 27.045 seconds

koertkuipers commented 12 years ago

yes i can provide you a sample. let me see if i can create a small sample that shows the same behavior.

i am using CDH3U2 with the haivvreo that was included with it. data is compressed with snappy.

jghoman commented 12 years ago

Great. Also try one that's not compressed. The compression feature was added recently and we're not using it, so I've not seen any bugs from it yet.

koertkuipers commented 12 years ago

can you provide me an email address to send the sample to?

On Thu, Dec 8, 2011 at 3:08 PM, Jakob Homan < reply@reply.github.com

wrote:

Great. Also try one that's not compressed. The compression feature was added recently and we're not using it, so I've not seen any bugs form it yet.


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3069192

jghoman commented 12 years ago

jghoman at gmail. I'm booked for the next few hours, but will try to process it with my local haivvreo and see what happens.

koertkuipers commented 12 years ago

i am getting the same issues with uncompressed data. i guess next i am going to check if my datafiles are somehow generated in a corrupt way.

koertkuipers commented 12 years ago

i read in my datafiles (compressed and uncompressed) into python and inspected them. they look fine

jghoman commented 12 years ago

I'm afraid I can't reproduce with my setup what you're seeing. Counts come out to 10k records in the file you sent; is this correct?

Regular statements work fine: https://gist.github.com/1449702 as do more complex, multistage queries: https://gist.github.com/1449703

This may indicate there's something significant different in our setups. I'm running a single node Apache 20.2 cluster with Hive 0.7.1 and the latest trunk from Haivvreo. The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did). I created an external database pointing at all the files you sent and specified the schema as a file on the hdfs with the schema of the first part file... The schema itself looks simple enough, I doubt there's any evolution that has happened and might be messing up resolution.

Is this the first time you're using haivvreo, or do you have other tables that are working? Any change between them?

Maybe @tomwhite has seen something like this in CDH-land?

jghoman commented 12 years ago

Is the actual file backing edges_1 in your case gibberish as well? Maybe it's just the CLI that's having trouble?

koertkuipers commented 12 years ago

files backing edges_1 are gibberish indeed

On Thu, Dec 8, 2011 at 8:53 PM, Jakob Homan < reply@reply.github.com

wrote:

Is the actual file backing edges_1 in your case gibberish as well? Maybe it's just the CLI that's having trouble?


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3074398

koertkuipers commented 12 years ago

i created a text based table (edges_txt) with the same data in it, and an empty avro backed table (edges_from_txt). then i did a simple: insert overwrite table edges_from_txt select * from edges_txt; create table edges_to_txt as select * from edges_from_txt;

i inspected the avro data backing edges_from_txt and it is good. then i inspected the text data backing edges_to_text and it is garbled. so it seems to happen on reading from avro in hadoop, and not in writing to it.

On Thu, Dec 8, 2011 at 8:49 PM, Jakob Homan < reply@reply.github.com

wrote:

I'm afraid I can't reproduce with my setup what you're seeing. Counts come out to 10k records in the file you sent; is this correct?

Regular statements work fine: https://gist.github.com/1449702 as do more complex, multistage queries: https://gist.github.com/1449703

This may indicate there's something significant different in our setups. I'm running a single node Apache 20.2 cluster with Hive 0.7.1 and the latest trunk from Haivvreo. The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did). I created an external database pointing at all the files you sent and specified the schema as a file on the hdfs with the schema of the first part file... The schema itself looks simple enough, I doubt there's any evolution that has happened and might be messing up resolution.

Is this the first time you're using haivvreo, or do you have other tables that are working? Any change between them?

Maybe @tomwhite (does that syntax work in github comments?) has seen something like this in CDH-land?


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3074356

jghoman commented 12 years ago

One of the queries above you mentioned should only return one result, but was returning 8. What do you get with the data you sent me on this query? hive> select * from edges where id1='fffb5e17-58be-47e4-a0e1-0b3431300c04'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201112081707_0016, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201112081707_0016 Kill Command = /Users/jhoman/hadoop_installs/20d2/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201112081707_0016 2011-12-08 18:25:00,498 Stage-1 map = 0%, reduce = 0% 2011-12-08 18:25:03,556 Stage-1 map = 4%, reduce = 0% 2011-12-08 18:25:06,615 Stage-1 map = 13%, reduce = 0% 2011-12-08 18:25:12,668 Stage-1 map = 21%, reduce = 0% 2011-12-08 18:25:15,694 Stage-1 map = 29%, reduce = 0% 2011-12-08 18:25:18,733 Stage-1 map = 37%, reduce = 0% 2011-12-08 18:25:21,774 Stage-1 map = 46%, reduce = 0% 2011-12-08 18:25:24,801 Stage-1 map = 54%, reduce = 0% 2011-12-08 18:25:27,826 Stage-1 map = 58%, reduce = 0% 2011-12-08 18:25:30,839 Stage-1 map = 67%, reduce = 0% 2011-12-08 18:25:33,864 Stage-1 map = 75%, reduce = 0% 2011-12-08 18:25:39,917 Stage-1 map = 83%, reduce = 0% 2011-12-08 18:25:42,940 Stage-1 map = 92%, reduce = 0% 2011-12-08 18:25:45,990 Stage-1 map = 100%, reduce = 0% 2011-12-08 18:25:49,007 Stage-1 map = 100%, reduce = 100% Ended Job = job_201112081707_0016 OK fffb5e17-58be-47e4-a0e1-0b3431300c04 6e896e4a-df8d-415d-9054-e80f571d177a If there's something wrong with the record reader (returning too many results), it might look like this.

koertkuipers commented 12 years ago

10k records is correct. this is the first time i am using haivvreo. i do not have any other tables that are working. all other tables i have created so far had same problems.

On Thu, Dec 8, 2011 at 8:49 PM, Jakob Homan < reply@reply.github.com

wrote:

I'm afraid I can't reproduce with my setup what you're seeing. Counts come out to 10k records in the file you sent; is this correct?

Regular statements work fine: https://gist.github.com/1449702 as do more complex, multistage queries: https://gist.github.com/1449703

This may indicate there's something significant different in our setups. I'm running a single node Apache 20.2 cluster with Hive 0.7.1 and the latest trunk from Haivvreo. The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did). I created an external database pointing at all the files you sent and specified the schema as a file on the hdfs with the schema of the first part file... The schema itself looks simple enough, I doubt there's any evolution that has happened and might be messing up resolution.

Is this the first time you're using haivvreo, or do you have other tables that are working? Any change between them?

Maybe @tomwhite (does that syntax work in github comments?) has seen something like this in CDH-land?


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3074356

koertkuipers commented 12 years ago

ok solved it see your personal email

tomwhite commented 12 years ago

No, haven't seen this problem before.

The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did).

Yes, it is the same code.

Tom

koertkuipers commented 12 years ago

Tom, This is what is going on i think (copying from private conversation with Jacob):

ok in the same hive terminal i had been using i created a new table from my (sound) text-backed edges table, called edges_txt. something like: create table edges_txt_2 as select * from edges_txt; then when i did a select from edges_txt_2 it was also gibberish. and there was no avro involved here. so i looked at the files behind it on disk and they had extenson .avro! so somehow my hive started defaulting to just giving all resulting files the extension .avro

the same exercise in a fresh hive terminal did not have this problem.

fixing all the extension of the files behind edges_txt_2 from .avro to .lzodeflate made the gibberish go away.

so i think there is some code in haivvreo that mistakenly changes the extensions of output files. and this extension-changing-behavior seems to happening in the hive session even when i am not working with avro backed tables (after i have issued a query on at least one table that is avro backed).

this also explains why it only shows up with mapreduce. mapreduce always creates new files, even when you feed the results to the terminal.

On Fri, Dec 9, 2011 at 2:27 PM, Tom White < reply@reply.github.com

wrote:

No, haven't seen this problem before.

The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did).

Yes, it is the same code.

Tom


Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3084226

jghoman commented 12 years ago

@tomwhite Does the change in 899a2c44cd25aca6c469d6c2fdb8419e66dda380 reset after each job? It sounds like the new file extension applies to all files created afterwards, even if they're not avro files. If so, we need to fix that.

koertkuipers commented 12 years ago

any query on a table that is backed by avro files will result in the new files having .avro extension, even if they are not avro files, with the result being the gibberish (due to misinterpretation of the files).

koertkuipers commented 12 years ago

@tomwhite i removed commit 899a2c4 and rebuild the jar and deployed it. doing a "create table edges1 as select * from edges" where edges is an avro backed hive table still causes the files for edges1 to have .avro extension (despite the fact that they are lzo compressed text files). i am puzzled. where else could this extension be set?

koertkuipers commented 12 years ago

after removing commit 899a2c4 and redistributing the jar to all nodes on cluster and restarting tasktrackers, everything seems to work now.

it didn't work before because i only replaced the jar in my personal hive lib folder but forgot that they are also on classpath in cluster and apparently those got used.

jghoman commented 12 years ago

Good to know wev'e tracked it down. I'll take a look this week at a a fix. We need to make sure we don't replace file extensions willy nilly. Thanks for keeping at this, Koert.