sfu-db / connector-x

Fastest library to load data from DB to DataFrames in Rust and Python
https://sfu-db.github.io/connector-x
MIT License
1.85k stars 146 forks source link

implement Trino source #585

Closed domnikl closed 2 months ago

domnikl commented 3 months ago

I'd really like to see Trino being implemented as a source in ConnectorX and thus have begun working on it. It uses prusto as a client and currently supports all basic types. Support for tuple, array, row and uuid isn't implemented yet and there's a bug with time columns not being mapped correctly. Also I want to look into fetching results more efficiently and not everything at once.

What do you think, is it worth continuing work on it? Is there anything else I need to consider eg. regarding mapping the results?

wangxiaoying commented 3 months ago

Thanks @domnikl for the PR. I think it in general looks great and complete!

Also I want to look into fetching results more efficiently and not everything at once.

Yes, I think this might be an issue in terms of performance. I don't have experience of using prusto before, but it seems like the get and get_next APIs are working for gradually fetching the results. Not sure whether it is easy to switch from get_all to these two.

time columns not being mapped correctly

I'm not sure what exactly the bug is. Currently we are converting the NaiveTime type in rust to String type in pandas in general (and also I see it in your pr), it is done by the function here. If you want to customize the conversion or mapping it to another type in pandas, it can be done by modifying the mapping in the macro as well as this function.

It would be great to also submit the corresponding seed database script here so others can run the python test_trino locally.

domnikl commented 3 months ago

@wangxiaoying thanks for looking into it!

I'm not sure what exactly the bug is

The problem is with reading timestamps with time zone types, prusto does not support it (yet) by the looks of it. Instead it returns EmptyData and it appears the result is just empty. But its not that important anyway I think to get starting. I'll create an issue in their project.

It would be great to also submit the corresponding seed database script

Will do in the next days and also rework fetching from the database with the get/get_next calls if possible.

domnikl commented 2 months ago

@wangxiaoying I implemented the missing partitioning as well as switched to get/get_next for prusto and fixed the tests and added test data and squashed some bugs along the way. Could you have a look at it again please?

wangxiaoying commented 2 months ago

Hi @domnikl , thanks so much for completing the PR!

I've tested the test cases locally and it seems works well. I have two minor comments above but in general I think the code looks good and complete. We probably also need to add a documentation page for trino here.

I think after fixing the above minor issues, we can merge it and make it into our next release!

domnikl commented 2 months ago

@wangxiaoying thanks for having a look! Added the missing pieces and a bit of documentation for it.