heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.96k stars 448 forks source link

golang python #815

Open caijielong opened 1 year ago

caijielong commented 1 year ago

How can I use programming languages ​​such as golang and python to connect to heavydb

cdessanti commented 1 year ago

Hello @caijielong,

With Python, you have the option to utilize the heavyai package for connecting, running queries, and more on HeavyDB. This package adheres to the Python DB API 2.0 standards.

To get started, you can follow the instructions for creating an environment and installing the packages available at this link: https://pypi.org/project/heavyai/

For a quick start and a comprehensive introduction to the API, including Release Notes and API Documentation, please refer to: https://heavyai.readthedocs.io/en/latest/

For additional examples, you can explore our documentation site, specifically in sections related to Python and Data Science: https://docs.heavy.ai/data-science/omnisci-data-science-foundation

Currently, we do not have a Golang-specific API, but you can explore using GDBC with our JDBC driver as an alternative option.

caijielong commented 1 year ago

@cdessanti Thank you for your careful and detailed answer

caijielong commented 1 year ago

@cdessanti Hello, I have a question. I now have 50 billion data that I want to import into heavyai to test its performance. I have tried csv, StreamImporter, KafkaImporter and insert into. The results are not satisfactory and the import speed is a bit slow. Is there any way to speed up my import speed?

cdessanti commented 1 year ago

@caijielong,

The performance of data loading depends on the data's location and format. Generally, using the COPY command with local Parquet files as the source is the fastest option. Loading data using Arrow is also quite efficient, and it's an excellent choice when you need to query data during the load process.

Data doesn't necessarily have to be in Parquet format, but if your ETL process involves Pandas, you can create Parquet files on-the-fly. Here's an example:

def load_table(self, table_name, client):
    table_arrow = pa.Table.from_pandas(self.df_to load, preserve_index=False)
    if self.mode == 'copy':
        pq.write_table(table_arrow, '/tmp/temp_nmb.pqt', compression='lz4', data_page_size=10000000)
        sql_command = f"copy {table_name} from '/tmp/temp_nmb.pqt' with(source_type='parquet_file')"
        client.con.con.execute(sql_command)
        os.remove('/tmp/temp_nmb.pqt')
    elif self mode == 'arrow':
        client.con.con.load_table_arrow(table_name, table_arrow)

In the first case, so usign the copy mode, we convert a Pandas DataFrame into an Arrow Table, create a Parquet file from Arrow on a local filesystem, and then use the server's COPY command to load it into a table. In the second mode, we simply load the Arrow table using the load_table_arrow function.

If your data is already in Parquet or CSV format, you can, load them from your S3 STORAGE location, move the files to the server and load them with the COPY command using Immerse, heavysql, another client, or programmatically as shown in the example.

The loading performance depends on various factors, including the number, data types, and contents of the columns, as well as the machine you're using for data ingestion. Typically, using local compressed CSV files, to load a regular table with 30 columns containing numerics, timestamps, and low cardinality strings, you can achieve a throughput of around 4 million records per second using 12 cores. This throughput doesn't change significantly if you're using Parquet.

I recommend starting by loading a small but representative subset of the data, optimizing the data types for the tables either manually or using the \o [table_name] command in heavysql, and then proceeding to load the remaining data for testing.

Our documentation on data loading is accessible through this link:

https://docs.heavy.ai/loading-and-exporting-data/command-line/load-data

If your data is stored in a database, you can explore the option of using the SQLImporter Java utility. However, please note that the throughput in this case may not be exceptionally high because the utility isn't primarily designed for speed.

If you'd like more assistance, please share the Data Definition Language (DDL) statements for your tables and provide additional detailed information about their formats.

By the way, processing such a high number of records will require a very powerful hardware setup with ample system and GPU memory to accommodate the data.

Regards, Candido

caijielong commented 1 year ago

@cdessanti Hello, the current import solution is flink -> kafka -> heavydb. When I import 1 billion data and execute select * from table_name limit 10;, an error will be reported! Error message: std::bad_alloc. What do I need to do?

HeavyDB Server CPU Memory Summary: MAX USE ALLOCATED FREE 20871.07 MB 0.00 MB 4096.00 MB 4096.00 MB

HeavyDB Server GPU Memory Summary: [GPU] MAX USE ALLOCATED FREE [0] 32768.00 MB 0.00 MB 4096.00 MB 4096.00 MB

cdessanti commented 1 year ago

Hi @caijielong,

The error message is quite generic, and I'm having trouble pinpointing the issue, especially with such a straightforward query and no additional data.

Firstly, could you specify where you encounter this generic error? Is it causing a database crash, or is it returned by the heavysql command? If possible, please provide a screenshot of the error message.

To assist you better, it would be beneficial if you could share the DDL of your table or provide essential information such as the number of columns, the data types used, and the fragment size (if you've deviated from the default). You can easily obtain this information by running the SHOW CREATE TABLE [table_name] command in heavysql or a similar SQL tool.

It's also crucial to know which version of the software you're using. You can find this by running the \version command in heavysql and sharing the results.

Additionally, sharing the database logs when the error occurs could provide insights. You can locate the logs in the directory where you store the data (typically /var/lib/heavyai/) under the storage/logs directory. The log files are named like this: heavydb.INFO.20231026-152844.log.

I attempted to reproduce the error using OS 7.0, and even when I started the system with very limited cache memory, I received the expected database exception:

heavysql> select * from flights_p limit 10;
ERR_OUT_OF_CPU_MEM: Not enough host memory to execute the query
heavysql> \memory_summary
HeavyDB Server CPU Memory Summary:
            MAX            USE      ALLOCATED           FREE
     2861.02 MB     2838.13 MB     2861.02 MB       22.89 MB

HeavyDB Server GPU Memory Summary:
[GPU]            MAX            USE      ALLOCATED           FREE
  [0]    10627.50 MB        0.00 MB        0.00 MB        0.00 MB

In my case, the system utilized as much memory as was available before throwing the error, suggesting that the SELECT operation aborted before execution in your case.

I'm eager to assist further once I have the requested infos, so we can work together to resolve this issue.

Regards, Candido

caijielong commented 1 year ago

Hi,@cdessanti Sorry, I will provide detailed information now. I re-imported the data. Please help me check which configuration is stuck.

1: vi /var/lib/heavyai/heavy.conf image

2: sudo systemctl restart heavydb

3: image

image

image

cdessanti commented 1 year ago

Hi,

Many thanks for the detailed information you provided.

I tried to reproduce your issue using the same database version, the very same table populated with hundreds of millions of generated data, but everything is working as expected.

While apparently, the error is in the heavysql tool; it would be better to investigate if something is going on in the server, so you should add the parameter verbose=true to your heavy.conf file and restart the server. Run in this order the following queries

select * from flights_2008_tm limit 10;
select * from tweets limit 10;

After you get the error, get the heavydb.INFO file located in cat /var/lib/heavyai/storage/log. You can get the contents of the file with the cat /var/lib/heavyai/storage/log/heavydb.INFO posting here. I've attached an example of the logfile at the same verbose level heavydb.info.txt

Also, to be sure that the heavysql tool is working as expected, please try to run the same query from another client; the SQL editor of Immerse is fine for this test, located at this URL in your machine http://localhost:6273/heavyai/sql-editor (I'm assuming that the software is installed on your workstation).

I hope to hear something from you soon.

Regards, Candido.

caijielong commented 1 year ago

Hi, @cdessanti I added configuration verbose=true. Here are the steps I took. I found that when querying multiple fields, an error will be reported. Please help me take a look.

1: sudo systemctl restart heavydb heavydb.info.txt

2: Image Image

3: I thought the amount of data was too large, so I deleted some data and then checked again. After deleting the data twice, there were more than 500 million left, but the error still occurred: std: bad_alloc Image

4: When I deleted only more than 300 million data, I changed to select the specified field. Another error was reported: ERR_OUT_OF_CPU_MEM: Not enough host memory to execute the query Image image

5: Finally, I have a question, does 100 million data require so much memory? When I delete 60 million data, execute select tweet_id,tweet_time,lat,lon,sender_id,sender_name,location,source,reply_to_user_id,reply_to_tweet_id,lang,followers,followees from tweets limit 1;

The memory has bottomed out, which is a bit incredible. image

image

cdessanti commented 1 year ago

Hi @caijielong,

It seems that you've attached the example heavy.INFO file rather than the one from your machine. To resolve the issue, I recommend restarting the server and running the SELECT * FROM tweets query. This will pinpoint the bad-alloc, enabling us to identify the problematic column.

Allow me to provide some insights into what happens internally within the server when executing a query like this. This understanding will help you grasp the memory requirements for such operations.

In heavydb, tables are locally divided into SLABs, each with a default size of 32 million rows. With your query, only the 1st SLAB is loaded into memory, representing roughly 1/32 of your table data. Approximately 3.7GB of memory is needed for data, in addition to server memory for the dictionaries required to decode the text-encoded values. Considering your average text length of around 115 bytes and multiplying it by 32 million, this accounts for the memory requirement. The tweet_text column likely demands a significant memory chunk in the form of a large dictionary, potentially causing the bad_alloc. (This assumption is based on the usage of long, distinctly different strings. For instance, if 1 billion unique strings average around 100 characters, it would require approximately 100GB of RAM to be loaded into memory.)

Addressing point 5: In your case, 100 million records require a maximum of 11.5GB of memory (115 100000000). However, running SELECT FROM table LIMIT 1 won't load 100 million records into memory but only 32 million, along with additional memory for dictionaries in TEXT-encoded fields.

For troubleshooting, verbose logs in the /var/lib/heavyai/storage/log/ directory, specifically the file named heavydb.INFO-[DATE]-[TIME].log (e.g., heavydb.INFO.20230828-072844.log), provide valuable information, like memory used by various pool including the dictionaries. Kindly run your queries and share these logs here, as they contain pertinent details necessary to address your problem.

Candido