kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.15k stars 83 forks source link

Reading parquet in Python has some rough edges, making it hard to diagnose problems #2139

Open prrao87 opened 9 months ago

prrao87 commented 9 months ago

I've upgraded to the latest version 0.0.9 and there are a few rough edges with reading parquet from Python.

Problem scenario

If I have a parquet file with two columns, id and name, of type INT64 and STRING respectively.

During development, sometimes we may specify the wrong data type during table creation, for example in this case:

def create_node_table(conn: Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Data(
                id STRING,
                name STRING
            )
        """
    )

In this case, the data type of the id should be INT64, but was specified as STRING, it just segfaults, making it very hard for the Python developer to know what they did wrong (it's not a Kùzu bug, and could have very easily been fixed if the user just had a better error message).

[1]    5208 segmentation fault  python test.py

In the reverse situation, the error is even more complex. Say I have a parquet file with the same two columns, id and name, but this time, the id column is also of type STRING like the name column is.

def create_node_table(conn: Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Data(
                id INT64,
                name STRING
            )
        """
    )

This time, if the user specified the id column in the node table creation as of the type INT64 (by mistake, when they should have specified STRING), it doesn't segfault but instead gives this weird error.

  File "/code/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 88, in execute
    self._connection.execute(
RuntimeError: ColumnChunk::templateCopyStringArrowArray

What should be a 2-second fix to the data type now becomes a tedious exercise for the Python developer to go back and inspect the data and learn (after some deliberation) that the data type specified in Python was wrong.

Developer experience focus

These sorts of issues are more generic, and I suppose adding these edge cases to the test suite could help avoid such issues, but it's impossible for the Kùzu team to know all the kinds of ways users may break the system.

In this case, malformed data types (or the user simply making a mistake) is a common enough scenario that the test suite can be expanded, but the longer term problem of C++ error propagation to Python is still something to ponder about. What should be a very simple fix, can take minutes of unnecessary developer effort to build even the simplest graphs. This is quite a big source of user frustration in Python, and it's pretty natural to expect a large portion of future Kùzu users to come from Python, so this is a serious enough issue to discuss -- I think it really makes sense to deep-dive into how to present better error messages to Python users, cc @andyfengHKU @semihsalihoglu-uw

andyfengHKU commented 9 months ago

Hi @prrao87, thanks for the bug report! I'll give some of my thoughts first.

Data loading bug

First I need admit in both scenarios the behaviour is unacceptable. A system should handle malformed input and error properly. For this specific case, the right behaviour is to strictly follow DDL and always try to cast the input data based on DDL. If a casting cannot succeed, e.g. casting a STRING to INT, a runtime error will be thrown.

@acquamarin will send out a fix soon.

Developer experience

Yes, this is something we didn't do a good job. I think the deeper problem is that we don't have good error messages/reporting mechanism on the cpp side, so the python API doesn't have proper error message to begin with. We received multiple feedbacks related to this, and I also believe this is something we should address with high priority now. I'll make the following 2 proposals to the team this week.

Once we make a decision we will open an issue to describe our roadmap and will keep u updated on the solution.

semihsalihoglu-uw commented 9 months ago

This is one of the most important things we should focus on. The main problem is the first bullet point: "Our current testing framework primarily focus on the correctness of result and does not test too many malformed queries.". We need to prioritize test cases that should throw errors, wrong inputs, malformed queries etc. to catch these and as our test cases catch these, we will write the necessary code to improve our error messages.

The first thing to do is for code reviewers to think seriously about malformed queries that should error and ask the PR submitters to add those tests.

prrao87 commented 9 months ago

Thank you! What you both said makes a lot of sense 👍🏽

prrao87 commented 8 months ago

@andyfengHKU and @acquamarin I think the new parquet reader test suite definitely needs to be looked at. Version 0.0.10 produces this bug whose behaviour is different from the previous version.

Looking at things from a larger perspective, most of my parquet-driven workflows involve polars in some shape or form, but I believe that the tests on your end assume that the parquet comes from pyarrow? I wonder if it makes sense to increase the diversity of parquet files being tested, taking parquets from numerous sources like polars, pyarrow, duckdb, etc. so that the upcoming versions don't show any regressions. Considering that the parquet reader you guys have written is built from scratch, it's expected to have a lot of rough edges, so I hope that this request will be considered for the upcoming release!

semihsalihoglu-uw commented 5 months ago

I assume this issue is still not resolved, right?

andyfengHKU commented 2 days ago

We now throw exception if column type doesn't match parquet type. Though we should take one step further to perform casting for numerical cases.