innobi / pantab

Read/Write pandas DataFrames with Tableau Hyper Extracts
BSD 3-Clause "New" or "Revised" License
114 stars 44 forks source link

RuntimeError: Could not init schema view from child schema 0: Error parsing schema->format: Unknown format: 'vu' #333

Closed skyth540 closed 1 month ago

skyth540 commented 1 month ago

Describe the bug frame_to_hyper can not recognize my schema

To Reproduce Steps to reproduce the behavior:

Convert .csv's to parquets with pl.Utf8 and pl.Float32 schema, read in the parquets, write to hyper


schema = {
    "STORE_DESC": pl.Utf8,
    "PERIOD_DESC": pl.Utf8,
    "PRIMARY_DEPARTMENT": pl.Utf8,
    "RECAP_DEPARTMENT": pl.Float32,
    "DEPARTMENT": pl.Float32
}

df = pl.read_csv(csv_file_path, schema=schema)
df.write_parquet(output_parquet_file_path)
df = pl.read_parquet(output_parquet_file_path)
df        #returns expected frame, datatypes are str and f32

pt.frame_to_hyper(df, path_to_hyper, table = "test")

pl.String still doesn't change anything

Expected behavior Writes my dataframe to a hyper

Desktop (please complete the following information):

Complete error

{
    "name": "RuntimeError",
    "message": "Could not init schema view from child schema 0: Error parsing schema->format: Unknown format: 'vu'",
    "stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 2
      1 path = path
----> 2 pt.frame_to_hyper(df, path, table = \"test\")

File c:\\Users\
icho\\anaconda3\\Lib\\site-packages\\pantab\\_writer.py:62, in frame_to_hyper(df, database, table, table_mode, not_null_columns, json_columns, geo_columns)
     51 def frame_to_hyper(
     52     df,
     53     database: Union[str, pathlib.Path],
   (...)
     59     geo_columns: Optional[set[str]] = None,
     60 ) -> None:
     61     \"\"\"See api.rst for documentation\"\"\"
---> 62     frames_to_hyper(
     63         {table: df},
     64         database,
     65         table_mode=table_mode,
     66         not_null_columns=not_null_columns,
     67         json_columns=json_columns,
     68         geo_columns=geo_columns,
     69     )

File c:\\Users\
icho\\anaconda3\\Lib\\site-packages\\pantab\\_writer.py:108, in frames_to_hyper(dict_of_frames, database, table_mode, not_null_columns, json_columns, geo_columns)
    101     return (table.schema_name.name.unescaped, table.name.unescaped)
    103 data = {
    104     convert_to_table_name(key): _get_capsule_from_obj(val)
    105     for key, val in dict_of_frames.items()
    106 }
--> 108 libpantab.write_to_hyper(
    109     data,
    110     path=str(tmp_db),
    111     table_mode=table_mode,
    112     not_null_columns=not_null_columns,
    113     json_columns=json_columns,
    114     geo_columns=geo_columns,
    115 )
    117 # In Python 3.9+ we can just pass the path object, but due to bpo 32689
    118 # and subsequent typeshed changes it is easier to just pass as str for now
    119 shutil.move(str(tmp_db), database)

RuntimeError: Could not init schema view from child schema 0: Error parsing schema->format: Unknown format: 'vu'"
}
WillAyd commented 1 month ago

Which version of polars are you using? 1.3 or later?

If so the root cause would be https://github.com/innobi/pantab/issues/316 which needs some upstream changes before we can fix in pantab.

If you are less than version 1.3 of polars then will have to take a closer look

skyth540 commented 1 month ago

I am using the latest version of polars-lts-cpu https://pypi.org/project/polars-lts-cpu/

On Fri, Sep 13, 2024, 4:07 PM William Ayd @.***> wrote:

Which version of polars are you using? 1.3 or later?

If so the root cause would be #316 https://github.com/innobi/pantab/issues/316 which needs some upstream changes before we can fix in pantab.

If you are less than version 1.3 of polars then will have to take a closer look

— Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/333#issuecomment-2350442276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXRFQKXMSWEP3AP7VDBP5HTZWNO2DAVCNFSM6AAAAABOF6J766VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJQGQ2DEMRXGY . You are receiving this because you authored the thread.Message ID: @.***>

WillAyd commented 1 month ago

Sounds good. Unfortunately for the time being you will have to downgrade polars or use a different dataframe library.

Hope to have solved relatively soon, but realistically may take a few months (we need a new nanoarrow release to happen first!)

skyth540 commented 1 month ago

I downgraded to polars-lts-cpu version 1.2.1, looks like I'm getting the same error


RuntimeError Traceback (most recent call last) Cell In[5], line 3 1 path = r"G:...test.hyper" ----> 3 pt.frame_to_hyper(df, path, table = 'test')

File c:\Users\nicho\anaconda3\Lib\site-packages\pantab_writer.py:62, in frame_to_hyper(df, database, table, table_mode, not_null_columns, json_columns, geo_columns) 51 def frame_to_hyper( 52 df, 53 database: Union[str, pathlib.Path], (...) 59 geo_columns: Optional[set[str]] = None, 60 ) -> None: 61 """See api.rst for documentation""" ---> 62 frames_to_hyper( 63 {table: df}, 64 database, 65 table_mode=table_mode, 66 not_null_columns=not_null_columns, 67 json_columns=json_columns, 68 geo_columns=geo_columns, 69 )

File c:\Users\nicho\anaconda3\Lib\site-packages\pantab_writer.py:108, in frames_to_hyper(dict_of_frames, database, table_mode, not_null_columns, json_columns, geo_columns) 101 return (table.schema_name.name.unescaped, table.name.unescaped) ... 117 # In Python 3.9+ we can just pass the path object, but due to bpo 32689 118 # and subsequent typeshed changes it is easier to just pass as str for now 119 shutil.move(str(tmp_db), database)

RuntimeError: Could not init schema view from child schema 0: Error parsing schema->format: Unknown format: 'vu'

WillAyd commented 1 month ago

Hmm strange. Many it is something with the lts version of polars that was backported then? I am not sure how that versioning is handled respective to the normal releases.

Our CI is pinned to the 1.2 normal version

https://github.com/innobi/pantab/blob/6dd61018beac3eef5763cdf6a0020d42d06a9401/pyproject.toml#L71

skyth540 commented 1 month ago

It looks like it's the same versions and release schedule as the regular polars. I tried 1.0 and still nothing

On Mon, Sep 16, 2024, 10:28 AM William Ayd @.***> wrote:

Hmm strange. Many it is something with the lts version of polars that was backported then? I am not sure how that versioning is handled respective to the normal releases.

Our CI is pinned to the 1.2 normal version

https://github.com/innobi/pantab/blob/6dd61018beac3eef5763cdf6a0020d42d06a9401/pyproject.toml#L71

— Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/333#issuecomment-2353384734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXRFQKQO2CCNYPLTDJV3EHLZW4BKVAVCNFSM6AAAAABOF6J766VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJTGM4DINZTGQ . You are receiving this because you authored the thread.Message ID: @.***>

WillAyd commented 1 month ago

Hmm strange. I'm not sure of the steps that lead to the issue across different versions then - that may be more of a question for the polars team. The root cause of the "issue" is that polars moved away from using Arrow strings and towards Hyper strings (which Arrow calls "String View"):

https://pola.rs/posts/polars-string-type/

We currently support the default Arrow strings, but not yet the "String View" that polars uses, as it is not yet available in the Arrow C Data interface helper library that we use (i.e. nanoarrow)

If you cared to follow that development upstream, it is happening in https://github.com/apache/arrow-nanoarrow/pull/596

skyth540 commented 1 month ago

Any recommendations on how to stay in polars and make it work? My data frame takes about 70gb of my 100gb to hold in memory, so I can't convert to_arrow() directly. Change my data types?

On Mon, Sep 16, 2024, 11:07 AM William Ayd @.***> wrote:

Hmm strange. I'm not sure of the steps that lead to the issue across different versions then - that may be more of a question for the polars team. The root cause of the "issue" is that polars moved away from using Arrow strings and towards Hyper strings (which Arrow calls "String View"):

https://pola.rs/posts/polars-string-type/

We currently support the default Arrow strings, but not yet the "String View" that polars uses, as it is not yet available in the Arrow C Data interface helper library that we use (i.e. nanoarrow)

If you cared to follow that development upstream, it is happening in apache/arrow-nanoarrow#596 https://github.com/apache/arrow-nanoarrow/pull/596

— Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/333#issuecomment-2353465920, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXRFQKWC5OLWONWHUL4RLNDZW4F6NAVCNFSM6AAAAABOF6J766VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJTGQ3DKOJSGA . You are receiving this because you authored the thread.Message ID: @.***>

WillAyd commented 1 month ago

I am totally guessing but maybe you can use pyarrow to read the file then convert that to a polars dataframe? Perhaps in that transition it will retain the old Arrow string layout format? (at least, that is how the test suite works currently)

Are you actually seeing a huge increase in memory usage when calling .to_arrow()? I assume (outside of strings) that many things are zero-copy. It's possible that the strings are the problem, but worth double checking.

One final alternative I can think of is to try another library like duckdb - while we don't give first class support to reading to a duckdb table it does use the Arrow PyCapsule interface is newer versions, so you should be able to write data from it

I'm traveling this week but hope to take another pass at the nanoarrow root issue and get that moving upstream

WillAyd commented 1 month ago

One other option - the Arrow C Data stream interface does not require all of your data to be in memory. If you can stream into pantab like through a memory mapped file you can work with datasets that are in theory infinitely large

skyth540 commented 1 month ago

I am totally guessing but maybe you can use pyarrow to read the file then convert that to a polars dataframe? Perhaps in that transition it will retain the old Arrow string layout format? (at least, that is how the test suite works currently)

Are you actually seeing a huge increase in memory usage when calling .to_arrow()? I assume (outside of strings) that many things are zero-copy. It's possible that the strings are the problem, but worth double checking.

Running the following:

folder_path = r"G:\...\parquet_bak"

table = pq.read_table(folder_path)
df = pl.from_arrow(table)

Fills 80/100gb of memory from the arrow table, then fills the rest of it and crashes with from_arrow()

One final alternative I can think of is to try another library like duckdb - while we don't give first class support to reading to a duckdb table it does use the Arrow PyCapsule interface is newer versions, so you should be able to write data from it

Not an issue pertaining to pantab, but I can't get duckdb to work with polars-lts-cpu very well... so I am unable to try this last one for now

WillAyd commented 1 month ago

Are you doing anything complicated in polars? Writing via pyarrow directly is also an option if you can skip polars altogether.

Seems like the cost of going from Arrow strings to hyper/umbra strings is the likely culprit for exhausting memory when converting to polars.

Ultimately the string view type isn't roundtrippable through the Hyper DB anyway, so will end back up with standard Arrow strings on a subsequent read

skyth540 commented 1 month ago

It's just these 2 operations filling up the ram.

table = pq.read_table(folder_path) # Fills ram to 80 df = pl.from_arrow(table) # Fills up the rest and crashes it

WillAyd commented 1 month ago

Unless you need polars for augmenting the table (it is unclear from your example if you do) you can skip the conversion to polars altogether and just write the pyarrow table, which may help with your memory issue:

table = pq.read_table(folder_path) # Fills ram to 80
pt.frame_to_hyper(table, "example.hyper", table="table")

Also since you are reading a directory of parquet files, you can avoid having to load them all into memory and read/write as a stream, which will greatly reduce your overall memory footprint. Something like this should work:

import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pantab as pt

schema = pq.read_schema("folder_path/sample_file..parquet")  # assumes all schemas are identical
dataset = ds.dataset(folder_path, format="parquet")
batches = dataset.to_batches()
reader = pa.RecordBatchReader.from_batches(schema, batches)
pt.frame_to_hyper(reader, "example.hyper", table="table")

There may be a more idiomatic way to get the schema and assert it is consistent across all files than what I have done

skyth540 commented 1 month ago

I didn't make it clear, but I have lots of manipulation to do that I'd like polars for. I was just testing in/output for reading in and saving to a .hyper to see how viable it is right now

It's not a pressing matter to get my project done very soon, so I can wait on pantab development about my open issues for now. I only needed to convert so that my polars frame would work with writing to a hyper. In a perfect world I could just read in my files, manipulate, and write to a hyper compatible with my version of tableau

MDP001 commented 1 month ago

I was able to get the frame_to_hyper working for strings using duckdb

import duckdb as db
import pantab as pt

df = db.read_csv('stores.csv', delimiter=',', header=True, columns=
    {
        'STORE_DESC': 'VARCHAR', 
        'PERIOD_DESC': 'VARCHAR', 
        'PRIMARY_DEPARTMENT': 'VARCHAR', 
        'RECAP_DEPARTMENT': 'INT', 
        'DEPARTMENT': 'INT'
    })
print(df)
pt.frame_to_hyper(df, 'stores.hyper', table = 'stores')

(python=3.9.6, Mac, pantab=5.0.0, duckdb=1.1.1)

image