greenlion / swanhart-tools

Swanhart Toolkit - MySQL tools for MPP query, Materialized Views, Async queries, and JSON UDF
Other
553 stars 141 forks source link

Data only from the first shard and ICE UDF install problem !!! #133

Closed DreamKetchup closed 8 years ago

DreamKetchup commented 9 years ago

Hello, I am new to shard-query and have some problems with it. I have a simple star schema with 7 shards(for each day of the week). On diffrent queries i receive data only from my first shard. Ive created simple test schema with 3 shards and tried same queries and everything worked fine. So now i'm really confused and don't know what can be the source of the problem.

Second problem with Infobright CE UDF installation.

mysql> CREATE FUNCTION gman_do RETURNS STRING -> SONAME "libgearman_mysql_udf.so";
ERROR 1126 (HY000): Can't open shared library 'libgearman_mysql_udf.so' (errno: 0 feature disabled) Using Ubuntu 14.04 and Infobright CE v. 4.0.7

Please advise me how to resolve these problems.

greenlion commented 9 years ago

Hi,

Use the --explain option of run_query to see what Shard-Query is doing with the seven shards. It sounds like not all of the shards have the shard key in them. Shard-Query will pick one of the shards as the "coordinator" shard, and the metadata for tables (such as what columns are in the table) is retrieved from that shard. If a table doesn't have the shard key, that means the table should be duplicated on each node. Thus a query from it should be sent to only one in that case, because to send it to more than one would result in duplicate rows.

You can't install the UDF with ICE. You will need to set up a regular MySQL database, create a database with the same name as your shard schema (it doesn't need to contain any tables), install the UDF, then use either the proxy or sq_helper.

I hope this helps,

--Justin

On Thu, Oct 15, 2015 at 6:39 AM, DreamKetchup notifications@github.com wrote:

Hello, I am new to shard-query and have some problems with it. I have a simple star schema with 7 shards(for each day of the week). On diffrent queries i receive data only from my first shard. Ive created simple test schema with 3 shards and tried same queries and everything worked fine. So now i'm really confused and don't know what can be the source of the problem.

Second problem with Infobright CE UDF installation.

mysql> CREATE FUNCTION gman_do RETURNS STRING -> SONAME "libgearman_mysql_udf.so";

ERROR 1126 (HY000): Can't open shared library 'libgearman_mysql_udf.so' (errno: 0 feature disabled) Using Ubuntu 14.04 and Infobright CE v. 4.0.7

Please advise me how to resolve these problems.

— Reply to this email directly or view it on GitHub https://github.com/greenlion/swanhart-tools/issues/133.

DreamKetchup commented 9 years ago

Hello Justin, Ok now i've noticed some strange behaviour. After installing new schemata. All shards are marked as coord_shard in shard-query GUI. But in query I'm choosing as coord shard only1st shard. Result from queries can be diffrent every time, sometimes from 1st shard,2nd so on. It shows only result from one shard. If i unmark all except 1st shard as coord_shard, then result will be only from 1st shard. I think i messed up with something

I'll try to describe the whole picture in more detail. This is a little piece of my fact table that i'm sharding. day_id is my shard column. All other dimensions are duplicated in each shard. In 1st shard also have some temporary tables, they are not used in reporting, so i did not duplicate them.

fact

So here is my shards

shard_set

I hope I have configured them correctly.

This is the query:

select object_dim.host_name as c0, day_dim.dayOfweekDesc as c2, sum(state_fact.value) as m0 from object_dim as object_dim, state_fact as state_fact, day_dim as day_dim where state_fact.object_id = object_dim.object_id and object_dim.host_name = 'DECT_GebF_3

    '

and state_fact.day_id = day_dim.day_id group by object_dim.host_name, day_dim.dayOfweekDesc

_runquery --verbose showed me this:

Shard-Query optimizer messages:

SQL TO SEND TO SHARDS: Array ( [0] => SELECT object_dim.host_name AS expr$0,day_dim.dayOfweekDesc AS expr$1,SUM(state_fact.value) AS expr_2617061748 FROM object_dim AS object_dim CROSS JOIN state_fact AS state_fact CROSS JOIN day_dim AS day_dim WHERE 1=1 AND state_fact.object_id = object_dim.object_id and object_dim.host_name = 'DECT_GebF_3

' and `state_fact`.`day_id` = `day_dim`.`day_id` GROUP BY expr$0,expr$1

) SQL TO SEND TO COORDINATOR NODE: SELECT expr$0 AS c0,expr$1 AS c2,SUM(expr_2617061748) AS m0 FROM aggregation_tmp_69434312 GROUP BY expr$0,expr$1

Array ( [c0] => DECT_GebF_3 [c2] => Sunday [m0] => 128160 ) 1 rows returned Exec time: 106.28450512886 INSTRUMENATION: Array ( [mysql_query_count] => 0 [mysql_prepare_count] => 0 [mysql_prepare_time] => 0 [mysql_connection_count] => 0 [mysql_connect_time] => 0 [mysql_query_exec_time] => 0 [mysql_deadlock_count] => 0 [php_service_time] => 1444997607.8773 [memory_usage] => 1532496 [cpu_user] => 20000 [cpu_system] => 4000 )

Only one result from the 1st shard. I don't see here that shard-query send query to other shards.

I hope you can explain me what i'm doing wrong.

Thanks in advance

greenlion commented 9 years ago

Hi,

Do any of the tables in the query have the shard key in them? If not, the query goes to only one node.

Let me see your .ini file and a show create table for each table involved.

On Fri, Oct 16, 2015 at 6:00 AM, Aleksandr notifications@github.com wrote:

Hello Justin, Ok now i've noticed some strange behaviour. After installing new schemata. All shards are marked as coord_shard in shard-query GUI. But in query I'm choosing as coord shard only1st shard. Result from queries can be diffrent every time, sometimes from 1st shard,2nd so on. It shows only result from one shard. If i unmark all except 1st shard as coord_shard, then result will be only from 1st shard. I think i messed up with something

I'll try to describe the whole picture in more detail. This is a little piece of my fact table that i'm sharding. day_id is my shard column. All other dimensions are duplicated in each shard. In 1st shard also have some temporary tables, they are not used in reporting, so i did not duplicate them.

[image: Встроенное изображение 1]

So here is my shards

[image: Встроенное изображение 2]

I hope I have configured them correctly.

This is the query:

select object_dim.host_name as c0, day_dim.dayOfweekDesc as c2, sum(state_fact.value) as m0 from object_dim as object_dim, state_fact as state_fact, day_dim as day_dim where state_fact.object_id = object_dim.object_id and object_dim.host_name = 'DECT_GebF_3

' and state_fact.day_id = day_dim.day_id group by object_dim.host_name, day_dim.dayOfweekDesc

_runquery --verbose showed me this:

Shard-Query optimizer messages:

  • The following projections may be selected for a UNIQUE CHECK on the storage node operation: expr$0,expr$1
  • storage node result set merge optimization enabled: ON DUPLICATE KEY UPDATE expr_2617061748=expr_2617061748 + VALUES(expr_2617061748)

SQL TO SEND TO SHARDS: Array ( [0] => SELECT object_dim.host_name AS expr$0,day_dim.dayOfweekDesc AS expr$1,SUM(state_fact.value) AS expr_2617061748 FROM object_dim AS object_dim CROSS JOIN state_fact AS state_fact CROSS JOIN day_dim AS day_dim WHERE 1=1 AND state_fact.object_id = object_dim.object_id and object_dim.host_name = 'DECT_GebF_3

' and state_fact.day_id = day_dim.day_id GROUP BY expr$0,expr$1 ) SQL TO SEND TO COORDINATOR NODE: SELECT expr$0 AS c0,expr$1 AS c2,SUM(expr_2617061748) AS m0 FROM aggregation_tmp_69434312 GROUP BY expr$0,expr$1

Array ( [c0] => DECT_GebF_3 [c2] => Sunday [m0] => 128160 ) 1 rows returned Exec time: 106.28450512886 INSTRUMENATION: Array ( [mysql_query_count] => 0 [mysql_prepare_count] => 0 [mysql_prepare_time] => 0 [mysql_connection_count] => 0 [mysql_connect_time] => 0 [mysql_query_exec_time] => 0 [mysql_deadlock_count] => 0 [php_service_time] => 1444997607.8773 [memory_usage] => 1532496 [cpu_user] => 20000 [cpu_system] => 4000 )

Only one result from the 1st shard. I don't see here that shard-query send query to other shards.

I hope you can explain me what i'm doing wrong.

Thanks in advance

2015-10-16 3:46 GMT+02:00 Justin Swanhart notifications@github.com:

Hi,

Use the --explain option of run_query to see what Shard-Query is doing with the seven shards. It sounds like not all of the shards have the shard key in them. Shard-Query will pick one of the shards as the "coordinator" shard, and the metadata for tables (such as what columns are in the table) is retrieved from that shard. If a table doesn't have the shard key, that means the table should be duplicated on each node. Thus a query from it should be sent to only one in that case, because to send it to more than one would result in duplicate rows.

You can't install the UDF with ICE. You will need to set up a regular MySQL database, create a database with the same name as your shard schema (it doesn't need to contain any tables), install the UDF, then use either the proxy or sq_helper.

I hope this helps,

--Justin

On Thu, Oct 15, 2015 at 6:39 AM, DreamKetchup notifications@github.com wrote:

Hello, I am new to shard-query and have some problems with it. I have a simple star schema with 7 shards(for each day of the week). On diffrent queries i receive data only from my first shard. Ive created simple test schema with 3 shards and tried same queries and everything worked fine. So now i'm really confused and don't know what can be the source of the problem.

Second problem with Infobright CE UDF installation.

mysql> CREATE FUNCTION gman_do RETURNS STRING -> SONAME "libgearman_mysql_udf.so";

ERROR 1126 (HY000): Can't open shared library 'libgearman_mysql_udf.so' (errno: 0 feature disabled) Using Ubuntu 14.04 and Infobright CE v. 4.0.7

Please advise me how to resolve these problems.

— Reply to this email directly or view it on GitHub https://github.com/greenlion/swanhart-tools/issues/133.

— Reply to this email directly or view it on GitHub < https://github.com/greenlion/swanhart-tools/issues/133#issuecomment-148572376

.

— Reply to this email directly or view it on GitHub https://github.com/greenlion/swanhart-tools/issues/133#issuecomment-148710662 .

DreamKetchup commented 9 years ago

Hi, Do i need to have shard_key in all dimensions or only in fact_table??? Before i had only in fact_table and all other dimensions were only duplicated in each shard.

Anyway i tried and added shard_key to each table that i'm using in query. So my tables look like this.

CREATE TABLE IF NOT EXISTS state_fact ( object_id int(11) NOT NULL, time_id int(11) , date_id int(11) NOT NULL, value_id int(11), state_type_id int(11), value float(11) DEFAULT NULL, day_id tinyint(2) DEFAULT NULL, interval_id int(11), shard_key int(3) DEFAULT 1 ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS object_dim ( object_id int(11) NOT NULL, host_name varchar(128), service_name varchar(200), uuid_host varchar(128), uuid_service varchar(128), is_active smallint(6), shard_key int(3) DEFAULT 1 ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS day_dim;

CREATE TABLE IF NOT EXISTS day_dim ( day_id int(11) NOT NULL, dayOfweekDesc varchar(10) NOT NULL, dayOfweek tinyint(2) NOT NULL, shard_key int(3) DEFAULT 1 ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Here i showed only 1st shard other differ only by shard_key value.

So here is my bootsrap.ini [default] user=root password=init

[config]

db=shard_query host=127.0.0.1

schema_name=1_monat

mapper=directory

column=shard_key column_datatype=integer

is_default_schema=1

inlist=false between=false

inlist_merge_threshold=128

inlist_merge_size=64

coord_engine=MYISAM

gearmand_path=/usr/sbin

[shard1] db=shard_1 host=127.0.0.1

[shard2] db=shard_2 host=127.0.0.1

[shard3] db=shard_3 host=127.0.0.1

[shard4] db=shard_4 host=127.0.0.1

[shard5] db=shard_5 host=127.0.0.1

[shard6] db=shard_6 host=127.0.0.1

[shard7] db=shard_7 host=127.0.0.1

I still receive data only from 1st shard.

greenlion commented 9 years ago

You only put the shard key in the fact table. if you do select count(_) from facttable it will scan all shards. If a table does not have the shard key in it, is must be duplicated in all shards. a select count() of the table will scan only one shard.

greenlion commented 9 years ago

Keep in mind that if it only show one SQL in the explain that does NOT mean only one shard is being scanned. It just means that the same SQL is being sent to all necessary shards, and that might be only one shard.

DreamKetchup commented 9 years ago

Hi, I found the problem. select object_dim.host_name as c0, day_dim.dayOfweekDesc as c2, sum(state_fact.value) as m0 from object_dim as object_dim, state_fact as state_fact, day_dim as day_dim where state_fact.object_id = object_dim.object_id and object_dim.host_name = \'DECT_GebF_3 \' and state_fact.day_id = day_dim.day_id group by object_dim.host_name, day_dim.dayOfweekDesc

With this query i receive data only from 1st shard. But if i delete all then everything works fine. Can you check this issue?

greenlion commented 9 years ago

Hi, please send CREATE TABLE for each involved table.

Sent from my iPhone

On Nov 6, 2015, at 3:43 AM, Aleksandr notifications@github.com wrote:

Hi, I found the problem. select object_dim.host_name as c0, day_dim.dayOfweekDesc as c2, sum(state_fact.value) as m0 from object_dim as object_dim, state_fact as state_fact, day_dim as day_dim where state_fact.object_id = object_dim.object_id and object_dim.host_name = 'DECT_GebF_3 ' and state_fact.day_id = day_dim.day_id group by object_dim.host_name, day_dim.dayOfweekDesc

With this query i receive data only from 1st shard. But if i delete all then everything works fine. Can you check this issue?

— Reply to this email directly or view it on GitHub.

greenlion commented 9 years ago

Sorry, I see them in earlier email. I will look into it this weekend.

Sent from my iPhone

On Nov 6, 2015, at 3:43 AM, Aleksandr notifications@github.com wrote:

Hi, I found the problem. select object_dim.host_name as c0, day_dim.dayOfweekDesc as c2, sum(state_fact.value) as m0 from object_dim as object_dim, state_fact as state_fact, day_dim as day_dim where state_fact.object_id = object_dim.object_id and object_dim.host_name = 'DECT_GebF_3 ' and state_fact.day_id = day_dim.day_id group by object_dim.host_name, day_dim.dayOfweekDesc

With this query i receive data only from 1st shard. But if i delete all then everything works fine. Can you check this issue?

— Reply to this email directly or view it on GitHub.

DreamKetchup commented 8 years ago

Hello again, I have some problems with ICE. I don't know how to run a query. I created an empty schema with the name that i have in sq. Then installed sq_helper but don't know how to run a query. Can i run it from webGUI ??? Can you exlain this in details.

greenlion commented 8 years ago

sq_helper won't work with ICE. you need to use the proxy. The web gui has not been updated in awhile. You might want to ask @hurdad for info about it.