pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.86k stars 1.92k forks source link

Suggest alternative setting of "data = BytesIO(table)", also support "data = BytesIO(columnname, datarow)" #10308

Closed hkpeaks closed 9 months ago

hkpeaks commented 1 year ago

Problem description

I have tested that BytesIO can accept a variable name which is a bytearray. My new python ETL is responsible for reading a particular file partition as a bytearray using the following 2 python scripts which is written in C++ by Python team:

byte_array = file.read(length_of_partition) file.seek(start_address_of_partition)

Note: I have tested the above 2 functions to build this preview big file app https://github.com/hkpeaks/pypeaks/blob/main/Peaks.py, it proven that Python run exceptional fast using this C++ function.

The end result is a bytearray of column names and many datarow partitions. If I join the 2 bytearray of column names and one of the datarow partitions, this means Python needs to rebuild the new bytearray by either appending or inserting, which takes extra processing time.

If BytesIO supports 2 variables, where the first variable is the column name and the second variable is the datarow, it may avoid spending extra time combining 2 bytearrays into one. In fact, my Golang ETL is implemented in this manner.

data = BytesIO(columnname, datarow) df = pl.read_csv(data)

My purpose for this arrangement is to customize a streaming engine for Polars or other python libraries to overcome datasets greater than memory. You know I had tested Polars for some scenarios of querying large datasets and Polars mainly handles very well for scenarios that only run Distinct/Groupby. If the ETL script which involves Filter, JoinTable, SplitFile, Sorting, etc., the result is either very slow or out of memory. The author has told me several times that their streaming engine is in alpha stage, but I may not want to wait for it to become mature as I understand it is not that simple to refactor the streaming engine inside the Polars project. My Golang ETL also faces similar problems, but at least it can run different kinds of queries for over memory size. The problem of my Golang ETL is mainly that every query command (except GroupBy/Distinct) result must be written to disk file; there is no way to keep it in memory for the next command.

Now I use Python to separate the streaming process from the query/transformation engine. Python ETL is responsible for identifying partitions that can fit into memory size and then distributing the partition to Polars to run a list of queries and then get return dataset. If the query involves Distinct/GroupBy, the streaming engine will send the collection of return datasets to Polars once again. If the query involves Filter and Jointable of a very large dataset, the Python streaming engine will append the return dataset to disk file by non-blocking method. If the query involves sorting of a very large dataset, this is a very complicated process that demands high-performance programming language. I may use my Golang code converted to Rust code and provide it to you so you can give me a new Python API for large dataset sorting.

Below is an example of use case deployed by a web browser using Python Websocket server:-

User's Python variable for a custom Web GUI (Javascript to collect below variable)

Script_Name = ExampleScript Data_File = 10BillionRows.csv (read from local machine, not send through internet) Master_File = Master.csv Staff_Filter = Mary,Peter,Join,King Amount_Filter = "10000..99999"

Upon user select ETL script and confirm the selection criteria, the Web will be return a HTML table silmilar to this video https://youtu.be/6hwbQmTXzMc (my old C# Web ETL project)

Script of the new ETL Framework for Polars

The Python ETL is responsible to scan the following query to generate a set of optimize Py-Polars script to run Polars. (I have similar optimized script generation from one system to alternative system when I had worked for my ex-employer).

ExampleScript = """ $Data_File ~ Summary.html .Filter: Saleman(%Staff_filter) .JoinTable: InnerJoin($Master_File) .AddColumn: Quantity, Unit_Price => Multiply(Amount) .Filter: Amount(%Amount_Filter) .GroupBy: Saleman, Shop, Product => Sum(Quantity) Sum(Amount) .OrderBy: Saleman(A) Product(A) Date(D)"""

Since the resulting html file can be very large, so the Python ETL Web is responsble to work with the client Javascipt event to send particular page (or next page) of the report. Actual return dataset for this scenario from Polars is not html, the Python ETL will request Polars apache arrow shall be the return dataset. The Python ETL Web is responsible to render the html based on Javascript event.

For Parquet is another story. It depends on whether Polars can offer Python API which allows to get meta data and select particular row block and column. For JSON table, column name is repeated to declared with data row, it does not trigger the above issue. For Excel, it must be a very very small dataset as the upper limit of row is only one million.

mcrumiller commented 1 year ago

Sorry, this is quite a lot of text, and even after reading through all of it I still don't really know what you're asking. This is an issue tracker, where issues are either 1) reports of bugs found in polars, or 2) feature requests.

It's hard to determine whether what you're asking is some sort of enhancement to the polars API, or a business issue specific to you. It appears you have some ETL process that may or may not involve polars. What exactly do you want?

hkpeaks commented 1 year ago

currently polars support data = BytesIO(byte_of_file) ## I use Python seek() function to get byte from specific file partition of a giant csv file

I want to pass the byte of csv file by 2 variable names using the Python streaming engine.

data = BytesIO(byte_of_columnname, byte_of_datarow) df = pl.scan_csv(data) plus other query function e.g. distinct, groupby

Currently my ETL is built by Golang. I plan migrating to Python with Polars and/or Rust supported for Query/Transformation function. I have wrote a sample Python code directly to certify Python can support me to write the new streaming engine. For query/data transformation, use Polars can save development time but I concern the performance issues as my research is focus on high performance of big data processing, currently working on 10 billion rows (Plan upgrade to 100 billion rows). So I will code by Rust directly for certain query function e.g. sorting of billions of rows. Databrick version of Spark can handle over trillions of rows for machine learning and business analytics, this explain why Databrick is very popular. But I cannot create account in Databrick as I am a retired person (input company field is mandatory).

mcrumiller commented 1 year ago

Hi @hkpeaks--I would love to help you but your request is still not well-defined. I think there may be a language barrier issue here which is unfortunate.

From what I can gather, you have a stream of bytes coming in and you want polars to be able to process them somehow? Without a more descriptive scenario it's hard to diagnose what you want. Can you do a little bit of work and create a minimal example?

hkpeaks commented 1 year ago

I asked bing chat before to give example how to use Polars to read 2 set of bytes "column" and "data_row" instead of file.csv to generate distinct results, the bing reply:

create a byte object

csv_byte = BytesIO(column_names + '\n' + data_row)

read the byte into a Polars DataFrame df

df = pl.read_csv(csv_byte )

get the distinct values in the DataFrame

distinct_values = df.distinct()

without concerning performance issues, Polars gave me the exact results.

but looking into this statement "csv_byte =BytesIO(column_names + '\n' + data_row)" It requires Python to combine 2 sets of bytes it will demand for extra time. So I suggest csv_file = StringIO(column_names, data_row) this can prevent python to combine 2 sets of bytes into single but this need Polars to amend its api.

Since Go is not popular to support Python bindings, I will use Rust instead. If Polars fit for purpose, I will use Polars. Otherwise, I will convert my Go code to Rust and to build Python bindings.

mcrumiller commented 1 year ago

I'm starting to think this is a bot. @ritchie46 @stinodego can you guys make any sense of what's going on here?

hkpeaks commented 1 year ago

Ritchie Vink known me well, you can ask him whether I am a bot. If you are not an experience programmer, you may not able to understand what my need. I aware many programmers are moving to use tool look like end-users, so I find difficult to find a real programmer like Ritchie Vink. Or you are not the programmer to responsible of Polars' streaming engine. And I agree for all website like Github, Reddit, they shall ban robot, for every confirmation of input, shall use a speical GUI for user to confirm input or SMS one-time passcode. Polars is one of excellent dataframe software except its streaming engine.

mcrumiller commented 1 year ago

@hpeaks, I'm the only one here who has taken the time to read through your ridiculously obtuse problem description, which is incomprehensible, and you aren't helping at all by being condescending. I work in good faith here and asked for some more information. When you respond with:

If you are not an experience programmer, you may not able to understand what my need.

the problem is your complete inability to communicate your need, not my inability to understand it. I'll let someone else address your issue, if they can muster the courage.

hkpeaks commented 1 year ago

My suggest enhancement is very clear, it is from one byte object to two byte object of the BytesIO. Since you suspect I am a bot, I need to reply you not politely to let you change your suspection I am not a bot.

mcrumiller commented 1 year ago

My suggest enhancement is very clear, it is from one byte object to two byte object of the BytesIO.

I'm not trying to be antagonistic here, I am trying to make life easier for developers who have to read and interpret the issues. Can you maybe update the title of your post to something more concise, like "one byte object to two byte object of the BytesIO", which still isn't very clear?

hkpeaks commented 1 year ago

done the title, now you may believe I am not a bot.

cmdlineluser commented 1 year ago

It looks like they are asking for io.BytesIO() to accept multiple arguments and "chain them together" to avoid having to use + to concatenate on the Python side.

io.BytesIO is part of the Python standard library though, so it's somewhat confusing as to why you are asking the Polars team?

https://docs.python.org/3/library/io.html#io.BytesIO

There's also a note in the docs regarding using bytearray() instead for efficent concatenation of bytes sequences:

if concatenating bytes objects, you can similarly use bytes.join() or io.BytesIO, or you can do in-place concatenation with a bytearray object. bytearray objects are mutable and have an efficient overallocation mechanism

hkpeaks commented 1 year ago

You are right. It is becuase I am not familar with Python, resulting I request the wrong github for enhancement. Since I am doing research for high performance of big data processing > 10 billion rows, I need avoid duplication of data copying. Python offers seek() to support reading bytes directly from column name and specific partition of data row directly, so next step is to feed them to a query function by 2 objects. Now I know Polars does not has such api, so I will convert my query engine written in Golang to Rust directly. And I will to explore any of reusable function from the Polars project, but I will call your Rust code directly. This is my first app written in Python to test whether Python is able to function as streaming engine to work with query engine written in Rust. https://github.com/hkpeaks/pypeaks/blob/main/Peaks.py

hkpeaks commented 1 year ago

Today I have completed my first app written in Rust.

Source Code (single main.rs file) : https://github.com/hkpeaks/peaks-consolidation/blob/main/Documents/PreviewFile/main.rs

I use file.seek(SeekFrom::Start(start_byte as u64)).unwrap(); and file.read_exact(&mut byte_array).unwrap(); to read first row for every file partition, so the performance is independence of file size. It extract and validate first row for every partition whether default number of partition is 100. But the demo video is python version: https://youtu.be/71GHzDnEYno

After this experience, I will spend time to learn Rust from online training to ensure a deep understanding of the language. So I prepare to use Polars Rust API instead of Python API e.g. Polars' Parquet library.

I believe your users will love this hyper fast preview and validation function if it is part of Polars and to be introduced by Luca Zanna in the linkedin.

alexander-beedie commented 9 months ago

(Closing: not actionable from the polars side).