kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.33k stars 96 forks source link

Bug: Copying data from Postgres via Pandas crashing the entire PC #4350

Closed Gilnore closed 16 hours ago

Gilnore commented 2 days ago

Kùzu version

V0.6.0

What operating system are you using?

Windows 11

What happened?

I was trying a COPY operation on a pandas data frame.

First my monitor size changed, enlarging everything, then the screen went black.

The data frame was very large about 21 GB in size, filling up 96% of my RAM.

The issue doesn’t go away even if I chunk down the data frame, just delays it.

The PC automatically restarts. The screen aspect is permanently changed.

Edit: further investigation shows a DWM crash followed by the PC no longer detecting the GPU.

Are there known steps to reproduce?

No response

ray6080 commented 2 days ago

Hi @Gilnore , sorry for the crash of your machine. This is weird to me. We haven't encountered any cases where the running/crash of Kuzu leads to changed settings of monitors, or restart, or crash of GPU. We can investigate into this a bit more to see what went wrong. Could u kindly provide a schema of the data frame? (if it's possible to dump the data frame to file and ship to us would be helpful) We can try to populate a large data frame to that size to reproduce the issue. Also, if you can provide the hardware configuration of your machine would be great.

@benjaminwinger can help look into this issue when he got time.

Gilnore commented 2 days ago

Hi @Gilnore , sorry for the crash of your machine. This is weird to me. We haven't encountered any cases where the running/crash of Kuzu leads to changed settings of monitors, or restart, or crash of GPU. We can investigate into this a bit more to see what went wrong. Could u kindly provide a schema of the data frame? (if it's possible to dump the data frame to file and ship to us would be helpful) We can try to populate a large data frame to that size to reproduce the issue. Also, if you can provide the hardware configuration of your machine would be great.

@benjaminwinger can help look into this issue when he got time.

Thanks for the reply. I loaded an entire SQL table with pandas read sql table. The table was in Postgres. I can provide the sqlalchemy code that generated the table. (I’ll need to quickly switch devices first, got to reboot the PC, I’ll attach the code in a separate comment).

Gilnore commented 2 days ago

here's the code that generated the sql table:

@architect.install('partition',
                   type='range',
                   subtype='date',
                   constraint='100000',
                   column='date_time',
                   db = postgres_url)
class history(postgres_base):
    __tablename__='history'
    identity: Mapped[stock_id] = composite(mapped_column('date_time',
                                                        sa.TIMESTAMP,
                                                        index=True),
                                          mapped_column('symbol',
                                                        sa.String,
                                                        index=True))
    Open:Mapped[float] = mapped_column('Open')
    Close:Mapped[float] = mapped_column('Close')
    High:Mapped[float] = mapped_column('High')
    Low:Mapped[float] = mapped_column('Low')
    Volume:Mapped[float|None] = mapped_column('Volume',nullable=True)
    Dividends:Mapped[float|None] = mapped_column('Dividends',nullable=True)
    Splits:Mapped[float|None] = mapped_column('Splits',nullable=True)
    __table_args__ = (sa.PrimaryKeyConstraint('date_time','symbol',name='identity'),
                      # {'postgresql_partition_by': 'RANGE (date_time)'}
                      )

here's the code that generates the graph:

def create_node_tables(self, table, kuzu_conn):
        #make nodes of various types by putting together some tables in sql
        columns = []
        table_lable = table.__tablename__
        #we need to add primary key statments
        #we can try to decypher the identity arg of table
        primaries = [key.name for key in table.__mapper__.primary_key]
        typed_primaries = []
        # Loop through columns in the table
        for col in table.__table__.columns:
            col_name = col.name
            col_type = col.type
            # Map SQLAlchemy types to Kuzu types (expand as needed)
            kuzu_type = self.map_sa_type(col_type)
            if col_name in primaries:
                typed_primaries.append(f"{col_name} {kuzu_type}")
                columns.append(f"{col_name} {kuzu_type}")
            else:
                columns.append(f"{col_name} {kuzu_type}")
        #we need to pack the primary key columns into a struct
        #if the primary key is a composite key, ie if len(primaries) > 1
        if len(typed_primaries)>1:
            # Join columns into the CREATE NODE statement
            columns_str = ", \n".join(columns)
            #then in the statment a composite key refers to a struct
            # primaries_str = ', '.join(typed_primaries)
            #unfortunatly the id must be string or numerical, so we're using
            #a seregant key
            create_node_statement = f"""CREATE NODE TABLE IF NOT EXISTS
            {table_lable} 
            (id SERIAL,
            {columns_str},
            PRIMARY KEY (id)
            );"""
        else:
            # Join columns into the CREATE NODE statement
            columns_str = ", \n".join(columns)
            #
            create_node_statement = f"""CREATE NODE TABLE IF NOT EXISTS
            {table_lable}
            ({columns_str},
            PRIMARY KEY ({primaries[0]})
            );"""
        kuzu_conn.execute(create_node_statement)

    def create_edge_tables(self,source, target, kuzu_conn):
        #source and target need to be specified if table is not
        #the only table on edge is DOT
        a = source.__tablename__
        b = target.__tablename__
        create_edge_statement = f"""
        CREATE REL TABLE IF NOT EXISTS {a}_effect_on_{b} (FROM {a}
                                       TO {b},
                                       source_time TIMESTAMP,
                                       target_time TIMESTAMP,
                                       correlation_index DOUBLE,
                                       exchange_index DOUBLE);"""
        kuzu_conn.execute(create_edge_statement)

    def read_to_node(self,table, kuzu_conn, chunksize = 0, sample = False):
        #pass in a sql table object and a kuzu connection
        table_name = table.__tablename__
        if table_name == 'history':
            #problem, the table is not detected this way.
            con = self.history_session.connection()
        else:
            con = self.session.connection()
        if chunksize > 0:
            df = pd.read_sql_table(table_name, con,chunksize=chunksize)
            #df should now be an iterator each returning 50000 rows
            columns = kuzu_conn._get_node_property_names(table_name)
            columns = list(columns.keys())
            if 'id' in columns:
                columns.remove('id')
            for data in df:
                #reordering dataframe
                data = data[columns]
                try:
                    kuzu_conn.execute(f"COPY {table_name} FROM data")
                except Exception as e:
                    print(f'\n\n\n error when dealing with {table_name}')
                    raise e
                if sample:
                    break
        else:
            df = pd.read_sql_table(table_name, con)
            columns = kuzu_conn._get_node_property_names(table_name)
            columns = list(columns.keys())
            if 'id' in columns:
                columns.remove('id')
            df = df[columns]
            try:
                kuzu_conn.execute(f"COPY {table_name} FROM df")
            except Exception as e:
                print(f'\n\n\n error when dealing with {table_name}')
                raise e

        print(f'{table_name} nodes allocated.')

also it seems kuzu.Database('econ_graph') encounters a unknown error and crashes, even if I were to exclude the problematic table, but everything works perfectly fine if i were to reduce the number of nodes to 100 using the sampling function of the read_to_node method.

Gilnore commented 2 days ago

most of this code is modified from chat-gpt, so there might be some bad notations and habits like using f strings and underscored, i don't know if that had to do with anything.

Gilnore commented 2 days ago

Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2.59 GHz with integrated UHD graphics 630 RAM 32.0 GB (31.8 GB 可用) Nvidia GTX 1660Ti GPU

my PC is in Chinese, I don't know if the language of OS have anything to do with it, but i can't change the language settings on Windows 11 home.

Gilnore commented 2 days ago

I should also mention when using chunk size of 50,000 , the PC crashes after running for 48 hours. The COPY operation would probably take longer. It is also somehow eating space on my C drive, when kuzu is installed completely on a microconda virtual environment in E drive. The PostgreSQL database was in D drive. Docker is in C for running kuzu-explorer, but it wasn't running.

Gilnore commented 2 days ago

I went and ran a SELECT COUNT (*) for the problematic table, there are 132123860 rows.

Gilnore commented 2 days ago

I tried again after clearing out some space (40Gb) in C drive in case there’s some type of memory leak or something that causes all the space to fill up.

Problem persists, first task manager crashes, then 10Gb of C drive memory is instantly occupied, then a mosaic coloured square appears in the screen centre, then the computer becomes locked and is unable to be unlocked.

I restarted the computer at that point trying to prevent another disconnection of GPU.

That 10 Gb remains occupied after restart, but only 5 Gb of temporary files was detected. I deleted temporary files prior to the test. I’m going to guess this is just some safety measure thing from windows for now, since sniffing through the files again didn’t give any changes.

Edit: I just realized that I listed all my cpu count in the kuzu connection, it might be that. I have since reduced that by half. Testing again.

Gilnore commented 2 days ago

Problem persists IMG_8679 IMG_8678 That’s what I managed to get as it crashed, screen did go black, but GPU didn’t disconnect this time.

The translation is Unknown software crash.

there’s also one with explorer.exe crash

The DWM crash was after everything automatically closed down.

no apparent change to disk memory this time.

Gilnore commented 2 days ago

Further reduction of cpu count to 1/4 results only in crash of IDE and program. I’ll now try no multithreading.

Gilnore commented 2 days ago

Never mind, it seems reducing cores didn’t help. What did was not opening task manager. It seems that as long as you have task manager open when the program runs, the crash is always bad enough to cause black screen.

Edit: trying again with just Spyder (IDE) open without multithreading, still crashed hard. It’s safe to say that the crash isn’t related to something like that.

The following is the only thing not black on screen: IMG_8681

Mouse controls are frozen, ctrl alt delete didn’t do anything.

Edit: this showed up, still no mouse control: IMG_8682

Then this: IMG_8683 Translated to if nothing happens after Ctrl Alt Delete, press Esc to restart.

DWM crash message happened after restart.

prrao87 commented 1 day ago

Hi @Gilnore, as an alternative to using Pandas read_sql which is entirely in memory, have you considered using Kùzu's Postgres extension to directly scan the contents of a Postgres table and copy it into Kùzu? https://docs.kuzudb.com/extensions/attach/rdbms/

You can copy subsets of your Postgres table into Kùzu by using a LOAD FROM subquery, as shown in the docs:

COPY Person FROM (LOAD FROM pg_table.person RETURN name);

Looking at your code above, all the operations to move data between your Postgres source and Kùzu is done from within Python, which explains the high memory usage that leads to the crash.

We haven't done extensive benchmarking on scanning performance from Postgres tables, but it should be better than relying on Pandas for that large a dataset, not to mention that it should require far less glue code and Python loops. Curious to know if this approach solves your crash and memory issues.

Gilnore commented 1 day ago

Hi @Gilnore, as an alternative to using Pandas read_sql which is entirely in memory, have you considered using Kùzu's Postgres extension to directly scan the contents of a Postgres table and copy it into Kùzu? https://docs.kuzudb.com/extensions/attach/rdbms/

You can copy subsets of your Postgres table into Kùzu by using a LOAD FROM subquery, as shown in the docs:

COPY Person FROM (LOAD FROM pg_table.person RETURN name);

Looking at your code above, all the operations to move data between your Postgres source and Kùzu is done from within Python, which explains the high memory usage that leads to the crash.

We haven't done extensive benchmarking on scanning performance from Postgres tables, but it should be better than relying on Pandas for that large a dataset, not to mention that it should require far less glue code and Python loops. Curious to know if this approach solves your crash and memory issues.

Oh that’s cool, I didn’t know, let me try that real quick. (Chat GPT really let me down XD)

Gilnore commented 1 day ago

I keep running into "postgres_scanner.duckdb_extension not found, Extension "postgres" is an existing extension." when running the attach statement for Postgres, and Unicode error when running install statements for SQLite.

the same error occurs for Postgres if I didn't run the install statement for it, but with another line like the following:

"postgres_scanner.duckdb_extension" not found. Extension "postgres" is an existing extension.

Install it first using "INSTALL postgres".

here's the function:

    def read_to_node(self,table, kuzu_conn,
                     skip_unsupported_table = False,
                     install_ext = False,
                     load_ext = False):
        #pass in a sql table object and a kuzu connection
        table_name = table.__tablename__
        install_stmt = None
        if table_name == 'history':
            #we'll load the plugin for postgresql
            install_stmt = """
            INSTALL postgres;
            """
            load_stmt = """
            LOAD EXTENSION postgres;
            """
            path = os.getenv('postgres_con')
            dbtp = 'postgres'
        else:
            #we'll load the plugin for sqlite
            install_stmt = """
            INSTALL sqlite;
            """
            load_stmt = """
            LOAD EXTENSION sqlite;
            """
            path = os.getenv('sqlite_path')
            dbtp = 'sqlite'
        att_stmt = f"""
        ATTACH '{path}' AS Original
        (dbtype {dbtp}, skip_unsupported_table = {skip_unsupported_table});
        """
        copy_stmt = f"COPY {table_name} FROM Original.{table_name};"

        if install_ext:
            kuzu_conn.execute(install_stmt)
        if load_ext:
            kuzu_conn.execute(load_stmt)
        kuzu_conn.execute(att_stmt)
        kuzu_conn.execute(copy_stmt)
        kuzu_conn.execute('CALL clear_attached_db_cache() RETURN *')
        kuzu_conn.execute('DETACH Original')

        print(f'{table_name} nodes allocated.')
acquamarin commented 20 hours ago

Hi @Gilnore Extensions are currently not working for v0.6.0. To use the postgres extension in kuzu, you can install our latest dev build by:

pip install kuzu==0.6.1.dev58
semihsalihoglu-uw commented 20 hours ago

I think this will be impossible for us to replicate on our side and I also don't think even if this was an issue on Kuzu's side, I cannot think of how we would be crashing other processes (and I doubt we are, the problem can be somewhere else).

The only thing I can think of right now is to not use Kuzu or any other DBMS in a system if a large chunk of RAM is already occupied by other processes. So if 21GB is already used by pandas, Kuzu won't be able to get enough basic memory to operate (Kuzu's buffer manager by default tries to grab 80% of your RAM). So the setting is not ideal. You can try to set the buffer manager to a smaller size but not to a few GB if your data is tens of GB.

Gilnore commented 16 hours ago

I see, thanks for everything. I'll continue with the extensions.