Open koertkuipers opened 12 years ago
Is there a schema set for the new table (describe extended and check for schema.literal or schema.url)?
the new table is created in hive's default format.
hive> describe extended edges_1;
OK
id1 string
id2 string
Detailed Table Information Table(tableName:edges_1, dbName:default, owner:koert, createTime:1323313762, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id1, type:string, comment:null), FieldSchema(name:id2, type:string, comment:null)], location:hdfs://node01/user/hive/warehouse/edges_1, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{transient_lastDdlTime=1323313762}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
this is the avro backed table:
hive> describe extended edges;
OK
id1 string from deserializer
id2 string from deserializer
Detailed Table Information Table(tableName:edges, dbName:default, owner:koert, createTime:1323311884, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://node01/data/edges, inputFormat:com.linkedin.haivvreo.AvroContainerInputFormat, outputFormat:com.linkedin.haivvreo.AvroContainerOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.linkedin.haivvreo.AvroSerDe, parameters:{schema.url=hdfs://node01:8020/data/schemas/edges.json, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1323311884}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) Time taken: 0.095 seconds
and this is how i created it: CREATE EXTERNAL TABLE edges ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ( 'schema.url'='hdfs://node01:8020/data/schemas/edges.json') STORED as INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/edges/';
Created tables don't bring along their serde properties. The new table (edges_1) doesn't have a schema.url to tell it where to get a reader schema for. Use alter table set serdeproperties (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTableProperties) to define schema.url to be the same as in edges. Although, if this is the problem, we definitely need to handle the exception better.
but isn't edges_1 a simple hive table backed by textformat?
i get the same issue with simple querues on edges (the avro backed table), even if i dont save the results in a new table. any query that involves mapreduce will return garbled results even on hive command line. for example:
hive> select * from edges limit 5;
OK
347eb58d-9d73-4087-980d-cdd4f9bcffc7 9b7a42e1-8986-48c8-a933-09fecf459289
0d7f380e-0a5c-4e36-af96-5aec6872b67a 3bccb2e1-389a-4443-a1e4-72deb8ed9da3
5ac84201-4adf-460c-91d8-d4f26c73cbe3 7804d00d-5bf0-4d51-82c1-0c95faec0156
fdb87eda-472c-40ec-8807-90e930c3fd09 1e46aa69-63dc-4ee7-b25c-09645c402589
5ba3ae16-c037-486b-b599-710c77976667 224cfcb1-7be6-4c5b-83d3-918bb3bdee27
Time taken: 3.249 seconds
hive> select * from edges where id1='fdb87eda-472c-40ec-8807-90e930c3fd09';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201111011704_1307, Tracking URL = http://node01:50030/jobdetails.jsp?jobid=job_201111011704_1307
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=node01:8021 -kill job_201111011704_1307
2011-12-08 00:05:21,154 Stage-1 map = 0%, reduce = 0%
2011-12-08 00:05:32,219 Stage-1 map = 9%, reduce = 0%
2011-12-08 00:05:35,238 Stage-1 map = 21%, reduce = 0%
2011-12-08 00:05:38,260 Stage-1 map = 58%, reduce = 0%
2011-12-08 00:05:39,269 Stage-1 map = 82%, reduce = 0%
2011-12-08 00:05:42,289 Stage-1 map = 91%, reduce = 0%
2011-12-08 00:05:45,310 Stage-1 map = 100%, reduce = 0%
2011-12-08 00:05:47,321 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201111011704_1307
OK
NULL
JN[fdb87eda-472c-40ec-8807-90e930c3fd09 1e46aa69-63dc-4ee7-b25c-09645c402589
NULL
NULL
NULL
NULL
NULL
NULL
Time taken: 29.391 seconds
hive>
Ah, sorry, it was late last night. It's interesting because along with some gibberish, one of the strings comes out OK. Are the data compressed? Would it be possible to get a sample of the data file so I can see if it works for me?
morning. the last query should return only one result. is it possible that it did, but the result was messed up in a way that caused it to render it to the display with multiple lines and with nulls and such?
for example see what happens if i modify the query like this:
hive> select count(*) as cnt from edges where id1='fdb87eda-472c-40ec-8807-90e930c3fd09';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
yes i can provide you a sample. let me see if i can create a small sample that shows the same behavior.
i am using CDH3U2 with the haivvreo that was included with it. data is compressed with snappy.
Great. Also try one that's not compressed. The compression feature was added recently and we're not using it, so I've not seen any bugs from it yet.
can you provide me an email address to send the sample to?
On Thu, Dec 8, 2011 at 3:08 PM, Jakob Homan < reply@reply.github.com
wrote:
Great. Also try one that's not compressed. The compression feature was added recently and we're not using it, so I've not seen any bugs form it yet.
Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3069192
jghoman at gmail. I'm booked for the next few hours, but will try to process it with my local haivvreo and see what happens.
i am getting the same issues with uncompressed data. i guess next i am going to check if my datafiles are somehow generated in a corrupt way.
i read in my datafiles (compressed and uncompressed) into python and inspected them. they look fine
I'm afraid I can't reproduce with my setup what you're seeing. Counts come out to 10k records in the file you sent; is this correct?
Regular statements work fine: https://gist.github.com/1449702 as do more complex, multistage queries: https://gist.github.com/1449703
This may indicate there's something significant different in our setups. I'm running a single node Apache 20.2 cluster with Hive 0.7.1 and the latest trunk from Haivvreo. The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did). I created an external database pointing at all the files you sent and specified the schema as a file on the hdfs with the schema of the first part file... The schema itself looks simple enough, I doubt there's any evolution that has happened and might be messing up resolution.
Is this the first time you're using haivvreo, or do you have other tables that are working? Any change between them?
Maybe @tomwhite has seen something like this in CDH-land?
Is the actual file backing edges_1 in your case gibberish as well? Maybe it's just the CLI that's having trouble?
files backing edges_1 are gibberish indeed
On Thu, Dec 8, 2011 at 8:53 PM, Jakob Homan < reply@reply.github.com
wrote:
Is the actual file backing edges_1 in your case gibberish as well? Maybe it's just the CLI that's having trouble?
Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3074398
i created a text based table (edges_txt) with the same data in it, and an empty avro backed table (edges_from_txt). then i did a simple: insert overwrite table edges_from_txt select * from edges_txt; create table edges_to_txt as select * from edges_from_txt;
i inspected the avro data backing edges_from_txt and it is good. then i inspected the text data backing edges_to_text and it is garbled. so it seems to happen on reading from avro in hadoop, and not in writing to it.
On Thu, Dec 8, 2011 at 8:49 PM, Jakob Homan < reply@reply.github.com
wrote:
I'm afraid I can't reproduce with my setup what you're seeing. Counts come out to 10k records in the file you sent; is this correct?
Regular statements work fine: https://gist.github.com/1449702 as do more complex, multistage queries: https://gist.github.com/1449703
This may indicate there's something significant different in our setups. I'm running a single node Apache 20.2 cluster with Hive 0.7.1 and the latest trunk from Haivvreo. The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did). I created an external database pointing at all the files you sent and specified the schema as a file on the hdfs with the schema of the first part file... The schema itself looks simple enough, I doubt there's any evolution that has happened and might be messing up resolution.
Is this the first time you're using haivvreo, or do you have other tables that are working? Any change between them?
Maybe @tomwhite (does that syntax work in github comments?) has seen something like this in CDH-land?
Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3074356
One of the queries above you mentioned should only return one result, but was returning 8. What do you get with the data you sent me on this query? hive> select * from edges where id1='fffb5e17-58be-47e4-a0e1-0b3431300c04'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201112081707_0016, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201112081707_0016 Kill Command = /Users/jhoman/hadoop_installs/20d2/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201112081707_0016 2011-12-08 18:25:00,498 Stage-1 map = 0%, reduce = 0% 2011-12-08 18:25:03,556 Stage-1 map = 4%, reduce = 0% 2011-12-08 18:25:06,615 Stage-1 map = 13%, reduce = 0% 2011-12-08 18:25:12,668 Stage-1 map = 21%, reduce = 0% 2011-12-08 18:25:15,694 Stage-1 map = 29%, reduce = 0% 2011-12-08 18:25:18,733 Stage-1 map = 37%, reduce = 0% 2011-12-08 18:25:21,774 Stage-1 map = 46%, reduce = 0% 2011-12-08 18:25:24,801 Stage-1 map = 54%, reduce = 0% 2011-12-08 18:25:27,826 Stage-1 map = 58%, reduce = 0% 2011-12-08 18:25:30,839 Stage-1 map = 67%, reduce = 0% 2011-12-08 18:25:33,864 Stage-1 map = 75%, reduce = 0% 2011-12-08 18:25:39,917 Stage-1 map = 83%, reduce = 0% 2011-12-08 18:25:42,940 Stage-1 map = 92%, reduce = 0% 2011-12-08 18:25:45,990 Stage-1 map = 100%, reduce = 0% 2011-12-08 18:25:49,007 Stage-1 map = 100%, reduce = 100% Ended Job = job_201112081707_0016 OK fffb5e17-58be-47e4-a0e1-0b3431300c04 6e896e4a-df8d-415d-9054-e80f571d177a If there's something wrong with the record reader (returning too many results), it might look like this.
10k records is correct. this is the first time i am using haivvreo. i do not have any other tables that are working. all other tables i have created so far had same problems.
On Thu, Dec 8, 2011 at 8:49 PM, Jakob Homan < reply@reply.github.com
wrote:
I'm afraid I can't reproduce with my setup what you're seeing. Counts come out to 10k records in the file you sent; is this correct?
Regular statements work fine: https://gist.github.com/1449702 as do more complex, multistage queries: https://gist.github.com/1449703
This may indicate there's something significant different in our setups. I'm running a single node Apache 20.2 cluster with Hive 0.7.1 and the latest trunk from Haivvreo. The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did). I created an external database pointing at all the files you sent and specified the schema as a file on the hdfs with the schema of the first part file... The schema itself looks simple enough, I doubt there's any evolution that has happened and might be messing up resolution.
Is this the first time you're using haivvreo, or do you have other tables that are working? Any change between them?
Maybe @tomwhite (does that syntax work in github comments?) has seen something like this in CDH-land?
Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3074356
ok solved it see your personal email
No, haven't seen this problem before.
The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did).
Yes, it is the same code.
Tom
Tom, This is what is going on i think (copying from private conversation with Jacob):
ok in the same hive terminal i had been using i created a new table from my (sound) text-backed edges table, called edges_txt. something like: create table edges_txt_2 as select * from edges_txt; then when i did a select from edges_txt_2 it was also gibberish. and there was no avro involved here. so i looked at the files behind it on disk and they had extenson .avro! so somehow my hive started defaulting to just giving all resulting files the extension .avro
the same exercise in a fresh hive terminal did not have this problem.
fixing all the extension of the files behind edges_txt_2 from .avro to .lzodeflate made the gibberish go away.
so i think there is some code in haivvreo that mistakenly changes the extensions of output files. and this extension-changing-behavior seems to happening in the hive session even when i am not working with avro backed tables (after i have issued a query on at least one table that is avro backed).
this also explains why it only shows up with mapreduce. mapreduce always creates new files, even when you feed the results to the terminal.
On Fri, Dec 9, 2011 at 2:27 PM, Tom White < reply@reply.github.com
wrote:
No, haven't seen this problem before.
The Haivvreo should be the same with CDH (assuming they didn't make any changes they didn't contribute back, which I don't believe they did).
Yes, it is the same code.
Tom
Reply to this email directly or view it on GitHub: https://github.com/jghoman/haivvreo/issues/9#issuecomment-3084226
@tomwhite Does the change in 899a2c44cd25aca6c469d6c2fdb8419e66dda380 reset after each job? It sounds like the new file extension applies to all files created afterwards, even if they're not avro files. If so, we need to fix that.
any query on a table that is backed by avro files will result in the new files having .avro extension, even if they are not avro files, with the result being the gibberish (due to misinterpretation of the files).
@tomwhite i removed commit 899a2c4 and rebuild the jar and deployed it. doing a "create table edges1 as select * from edges" where edges is an avro backed hive table still causes the files for edges1 to have .avro extension (despite the fact that they are lzo compressed text files). i am puzzled. where else could this extension be set?
after removing commit 899a2c4 and redistributing the jar to all nodes on cluster and restarting tasktrackers, everything seems to work now.
it didn't work before because i only replaced the jar in my personal hive lib folder but forgot that they are also on classpath in cluster and apparently those got used.
Good to know wev'e tracked it down. I'll take a look this week at a a fix. We need to make sure we don't replace file extensions willy nilly. Thanks for keeping at this, Koert.
my data comes out of external tables backed by avro nicely when i just do "select * from table limit 10;"
however as soon as i do anything that involves map-reduce the data becomes garbled.
for example: hive> select * from edges limit 5; OK 347eb58d-9d73-4087-980d-cdd4f9bcffc7 9b7a42e1-8986-48c8-a933-09fecf459289 0d7f380e-0a5c-4e36-af96-5aec6872b67a 3bccb2e1-389a-4443-a1e4-72deb8ed9da3 5ac84201-4adf-460c-91d8-d4f26c73cbe3 7804d00d-5bf0-4d51-82c1-0c95faec0156 fdb87eda-472c-40ec-8807-90e930c3fd09 1e46aa69-63dc-4ee7-b25c-09645c402589 5ba3ae16-c037-486b-b599-710c77976667 224cfcb1-7be6-4c5b-83d3-918bb3bdee27 hive> create table edges_1 as select * from edges;
Total MapReduce jobs = 2 ... hive> select * from edges_1 limit 10; OK ��w!p73a2b0ae-bfa5-42fe-bd8a-607c5bbf5623 4c504f74-b518-4381-aa00-b0f4ad1ecd0e 0f3c4d08-934f-4ba2-94e7-53e4910f1095 d01b4177-9c9c-4417-P /e-93a004a03a4e 12331383-21fb-4b75-ba3a-dece3256ea90 1ea29a89-4f5eD 6a2-824e-af6e01808a9d 3b7ffcd8-1bff-4035-8650-234bea646c48 7ed7c3f1-905f-H4e-9522-4b5b5a0befa0 6c7db398-7c75-457b-af20-32f7eefe9285 941f1ac5-afc0d a-a30f-864d0d956249 cf7def1f-a925-46d L -8cd74xef0 �
72-4d768385940a NULL $ 6255463d-941c-4ab4-8ee6-26c0b428b72c �
not sure what is going on here... the only haivvreo related logging i could see was: 2011-12-07 22:08:52,397 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Unable to match filesplit /data/edges/part-00002.avro:0+30819496 with a partition.