manticoresoftware / columnar

Manticore Columnar Library
Apache License 2.0
82 stars 15 forks source link

Why the create table sql is different from the show create table sql #12

Closed sangensong closed 2 years ago

sangensong commented 2 years ago
mysql> create table idx(title text, type int, price float engine='columnar');
Query OK, 0 rows affected (0.01 sec)

mysql> show create table idx;
+-------+------------------------------------------------------------------------------+
| Table | Create Table                                                                 |
+-------+------------------------------------------------------------------------------+
| idx   | CREATE TABLE idx (
type integer,
price float engine='columnar',
title text
) |
+-------+------------------------------------------------------------------------------+
1 row in set (0.00 sec)

I am very puzzled as to why the above two are different. Is there any adjustment in the manticore?

sanikolaev commented 2 years ago

Hi. I don't think it has anything to do with the columnar library since without it it behaves similarly:

mysql> create table idx(title text, type int, price float);
Query OK, 0 rows affected (0.00 sec)

mysql> show create table idx;
+-------+------------------------------------------------------------+
| Table | Create Table                                               |
+-------+------------------------------------------------------------+
| idx   | CREATE TABLE idx (
type integer,
price float,
title text
) |
+-------+------------------------------------------------------------+
1 row in set (0.00 sec)

If you believe it's a problem please create an issue in https://github.com/manticoresoftware/manticoresearch/issues and explain what's bad in this behaviour.

sangensong commented 2 years ago

  Thank you for your reply.    My usage scenario is like this. I have a lot of csv files that have been generated, and these csv files are still being generated. At present, I have inserted these csv files into clickhouse. Now, I want to perform full-text search on my data. Due to the huge amount of data, I feel that it is not suitable to use clickhouse for full-text search. At this time, I found your project, and I want to use your project to do it. Full Text Search. Because the field correspondence of my csv file has been determined, I see that the order of the fields seems to be adjusted when I create the table, and I am afraid that these csv files cannot be inserted into the manticore. Also, I would like to ask, does a command like clickhouse exist in manticore

cat a.csv | clickhouse-client --database=db --query="insert into table test format CSV"

  My create table sql like this:

create table t_log_http (
    log_time bigint, 
    uuid bigint, 
    src_mac text stored, 
    dst_mac text stored, 
    src_location text stored, 
    dst_location text stored, 
    app_protocol text stored, 
    app_name text stored, 
    is_ipv6 bit(8), 
    src_ip text stored, 
    dst_ip text stored, 
    ip_protocol text stored, 
    src_port bit(16), 
    dst_port bit(16), 
    method text stored, 
    uri text stored, 
    host text, 
    url_category text stored, 
    referer text stored, 
    user_agent text stored, 
    xff text stored, 
    subject text stored, 
    rsp_code text stored) engine='columnar';

  I currently don't see the way I want on the official documentation, so what I think is to use a method like the following to insert data, but this execution efficiency is too low

cat t_log_http.csv|awk -F "," '{printf("insert into t_log_http values(%s,%s,'\''%s'\'',%s,%s,%s,%s,%s,%s,%s,%s,%s,'\''
%s'\'',%s);\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14)}' > t_log_http.sql
githubmanticore commented 2 years ago

➤ Sergey Nikolaev commented:

the order of the fields seems to be adjusted when I create the table, and I am afraid that these csv files cannot be inserted into the manticore

When you do INSERT INTO tbl VALUES(...) in my opinion it's always good to specify the fields list explicitly, e.g. insert into t_log_http(log_time, uuid, ...) values(...).

There's no tool like clickhouse-client which can accept data at STDIN and convert it into INSERT INTO. Just simple awk may be suboptimal since it inserts data line by line, which may be not very fast, but it's the right direction. I recommend two ways:

  1. use plain index and source of type csvpipe. Interactive course: https://play.manticoresearch.com/csv/. Docs: https://manual.manticoresearch.com/Adding_data_from_external_storages/Fetching_from_CSV,TSV
  2. make more complex script which can do batches and parallel inserts, e.g. you can take this as an example https://gist.githubusercontent.com/sanikolaev/6a48de957b41481512ff8d94ed4af351/raw/e49d9212bcd964ad08c7199c08f401f6470b4501/load_sql.php .
sangensong commented 2 years ago

The reason I don't use plain mode is that I see on the docs

What you cannot do with a plain index:

In my usage scenario, I will continuously generate csv files and need to continuously insert data. Is the plain index mode still appropriate in this case?

tomatolog commented 2 years ago

full text fields and attributes are different types and all statements like desc show create table and so on iterate attributes then fields that is why there is no original order, ie attributes and fields are not mixed.

sanikolaev commented 2 years ago

In my usage scenario, I will continuously generate csv files and need to continuously insert data. Is the plain index mode still appropriate in this case?

Continuous data processing is also possible with plain indexes, since you can:

etc.

But it's easier to just use an RT index. I gave you a script example, if you know php it shouldn't be a big deal to make it do what you want. Eventually we have in plans to implement some manticore-client (similar to clickhouse-client), but unfortunately it's only plans yet. Pull requests are very welcome!