Closed skyth540 closed 1 month ago
Which version of polars are you using? 1.3 or later?
If so the root cause would be https://github.com/innobi/pantab/issues/316 which needs some upstream changes before we can fix in pantab.
If you are less than version 1.3 of polars then will have to take a closer look
I am using the latest version of polars-lts-cpu https://pypi.org/project/polars-lts-cpu/
On Fri, Sep 13, 2024, 4:07 PM William Ayd @.***> wrote:
Which version of polars are you using? 1.3 or later?
If so the root cause would be #316 https://github.com/innobi/pantab/issues/316 which needs some upstream changes before we can fix in pantab.
If you are less than version 1.3 of polars then will have to take a closer look
— Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/333#issuecomment-2350442276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXRFQKXMSWEP3AP7VDBP5HTZWNO2DAVCNFSM6AAAAABOF6J766VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJQGQ2DEMRXGY . You are receiving this because you authored the thread.Message ID: @.***>
Sounds good. Unfortunately for the time being you will have to downgrade polars or use a different dataframe library.
Hope to have solved relatively soon, but realistically may take a few months (we need a new nanoarrow release to happen first!)
I downgraded to polars-lts-cpu version 1.2.1, looks like I'm getting the same error
RuntimeError Traceback (most recent call last) Cell In[5], line 3 1 path = r"G:...test.hyper" ----> 3 pt.frame_to_hyper(df, path, table = 'test')
File c:\Users\nicho\anaconda3\Lib\site-packages\pantab_writer.py:62, in frame_to_hyper(df, database, table, table_mode, not_null_columns, json_columns, geo_columns) 51 def frame_to_hyper( 52 df, 53 database: Union[str, pathlib.Path], (...) 59 geo_columns: Optional[set[str]] = None, 60 ) -> None: 61 """See api.rst for documentation""" ---> 62 frames_to_hyper( 63 {table: df}, 64 database, 65 table_mode=table_mode, 66 not_null_columns=not_null_columns, 67 json_columns=json_columns, 68 geo_columns=geo_columns, 69 )
File c:\Users\nicho\anaconda3\Lib\site-packages\pantab_writer.py:108, in frames_to_hyper(dict_of_frames, database, table_mode, not_null_columns, json_columns, geo_columns) 101 return (table.schema_name.name.unescaped, table.name.unescaped) ... 117 # In Python 3.9+ we can just pass the path object, but due to bpo 32689 118 # and subsequent typeshed changes it is easier to just pass as str for now 119 shutil.move(str(tmp_db), database)
RuntimeError: Could not init schema view from child schema 0: Error parsing schema->format: Unknown format: 'vu'
Hmm strange. Many it is something with the lts version of polars that was backported then? I am not sure how that versioning is handled respective to the normal releases.
Our CI is pinned to the 1.2 normal version
https://github.com/innobi/pantab/blob/6dd61018beac3eef5763cdf6a0020d42d06a9401/pyproject.toml#L71
It looks like it's the same versions and release schedule as the regular polars. I tried 1.0 and still nothing
On Mon, Sep 16, 2024, 10:28 AM William Ayd @.***> wrote:
Hmm strange. Many it is something with the lts version of polars that was backported then? I am not sure how that versioning is handled respective to the normal releases.
Our CI is pinned to the 1.2 normal version
https://github.com/innobi/pantab/blob/6dd61018beac3eef5763cdf6a0020d42d06a9401/pyproject.toml#L71
— Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/333#issuecomment-2353384734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXRFQKQO2CCNYPLTDJV3EHLZW4BKVAVCNFSM6AAAAABOF6J766VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJTGM4DINZTGQ . You are receiving this because you authored the thread.Message ID: @.***>
Hmm strange. I'm not sure of the steps that lead to the issue across different versions then - that may be more of a question for the polars team. The root cause of the "issue" is that polars moved away from using Arrow strings and towards Hyper strings (which Arrow calls "String View"):
https://pola.rs/posts/polars-string-type/
We currently support the default Arrow strings, but not yet the "String View" that polars uses, as it is not yet available in the Arrow C Data interface helper library that we use (i.e. nanoarrow)
If you cared to follow that development upstream, it is happening in https://github.com/apache/arrow-nanoarrow/pull/596
Any recommendations on how to stay in polars and make it work? My data frame takes about 70gb of my 100gb to hold in memory, so I can't convert to_arrow() directly. Change my data types?
On Mon, Sep 16, 2024, 11:07 AM William Ayd @.***> wrote:
Hmm strange. I'm not sure of the steps that lead to the issue across different versions then - that may be more of a question for the polars team. The root cause of the "issue" is that polars moved away from using Arrow strings and towards Hyper strings (which Arrow calls "String View"):
https://pola.rs/posts/polars-string-type/
We currently support the default Arrow strings, but not yet the "String View" that polars uses, as it is not yet available in the Arrow C Data interface helper library that we use (i.e. nanoarrow)
If you cared to follow that development upstream, it is happening in apache/arrow-nanoarrow#596 https://github.com/apache/arrow-nanoarrow/pull/596
— Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/333#issuecomment-2353465920, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXRFQKWC5OLWONWHUL4RLNDZW4F6NAVCNFSM6AAAAABOF6J766VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJTGQ3DKOJSGA . You are receiving this because you authored the thread.Message ID: @.***>
I am totally guessing but maybe you can use pyarrow to read the file then convert that to a polars dataframe? Perhaps in that transition it will retain the old Arrow string layout format? (at least, that is how the test suite works currently)
Are you actually seeing a huge increase in memory usage when calling .to_arrow()
? I assume (outside of strings) that many things are zero-copy. It's possible that the strings are the problem, but worth double checking.
One final alternative I can think of is to try another library like duckdb - while we don't give first class support to reading to a duckdb table it does use the Arrow PyCapsule interface is newer versions, so you should be able to write data from it
I'm traveling this week but hope to take another pass at the nanoarrow root issue and get that moving upstream
One other option - the Arrow C Data stream interface does not require all of your data to be in memory. If you can stream into pantab like through a memory mapped file you can work with datasets that are in theory infinitely large
I am totally guessing but maybe you can use pyarrow to read the file then convert that to a polars dataframe? Perhaps in that transition it will retain the old Arrow string layout format? (at least, that is how the test suite works currently)
Are you actually seeing a huge increase in memory usage when calling
.to_arrow()
? I assume (outside of strings) that many things are zero-copy. It's possible that the strings are the problem, but worth double checking.
Running the following:
folder_path = r"G:\...\parquet_bak"
table = pq.read_table(folder_path)
df = pl.from_arrow(table)
Fills 80/100gb of memory from the arrow table, then fills the rest of it and crashes with from_arrow()
One final alternative I can think of is to try another library like duckdb - while we don't give first class support to reading to a duckdb table it does use the Arrow PyCapsule interface is newer versions, so you should be able to write data from it
Not an issue pertaining to pantab, but I can't get duckdb to work with polars-lts-cpu very well... so I am unable to try this last one for now
Are you doing anything complicated in polars? Writing via pyarrow directly is also an option if you can skip polars altogether.
Seems like the cost of going from Arrow strings to hyper/umbra strings is the likely culprit for exhausting memory when converting to polars.
Ultimately the string view type isn't roundtrippable through the Hyper DB anyway, so will end back up with standard Arrow strings on a subsequent read
It's just these 2 operations filling up the ram.
table = pq.read_table(folder_path) # Fills ram to 80 df = pl.from_arrow(table) # Fills up the rest and crashes it
Unless you need polars for augmenting the table (it is unclear from your example if you do) you can skip the conversion to polars altogether and just write the pyarrow table, which may help with your memory issue:
table = pq.read_table(folder_path) # Fills ram to 80
pt.frame_to_hyper(table, "example.hyper", table="table")
Also since you are reading a directory of parquet files, you can avoid having to load them all into memory and read/write as a stream, which will greatly reduce your overall memory footprint. Something like this should work:
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pantab as pt
schema = pq.read_schema("folder_path/sample_file..parquet") # assumes all schemas are identical
dataset = ds.dataset(folder_path, format="parquet")
batches = dataset.to_batches()
reader = pa.RecordBatchReader.from_batches(schema, batches)
pt.frame_to_hyper(reader, "example.hyper", table="table")
There may be a more idiomatic way to get the schema and assert it is consistent across all files than what I have done
I didn't make it clear, but I have lots of manipulation to do that I'd like polars for. I was just testing in/output for reading in and saving to a .hyper to see how viable it is right now
It's not a pressing matter to get my project done very soon, so I can wait on pantab development about my open issues for now. I only needed to convert so that my polars frame would work with writing to a hyper. In a perfect world I could just read in my files, manipulate, and write to a hyper compatible with my version of tableau
I was able to get the frame_to_hyper working for strings using duckdb
import duckdb as db
import pantab as pt
df = db.read_csv('stores.csv', delimiter=',', header=True, columns=
{
'STORE_DESC': 'VARCHAR',
'PERIOD_DESC': 'VARCHAR',
'PRIMARY_DEPARTMENT': 'VARCHAR',
'RECAP_DEPARTMENT': 'INT',
'DEPARTMENT': 'INT'
})
print(df)
pt.frame_to_hyper(df, 'stores.hyper', table = 'stores')
(python=3.9.6, Mac, pantab=5.0.0, duckdb=1.1.1)
Describe the bug frame_to_hyper can not recognize my schema
To Reproduce Steps to reproduce the behavior:
Convert .csv's to parquets with pl.Utf8 and pl.Float32 schema, read in the parquets, write to hyper
pl.String still doesn't change anything
Expected behavior Writes my dataframe to a hyper
Desktop (please complete the following information):
Complete error