innobi / pantab

Read/Write pandas DataFrames with Tableau Hyper Extracts
BSD 3-Clause "New" or "Revised" License
114 stars 44 forks source link

Enhance support for multiple tables (publishing to tableau server) #173

Closed tucheck24 closed 7 months ago

tucheck24 commented 2 years ago

Pantab supports saving multiple dataframes into one hyperfile. That works and can be opened by my local tableau desktop.

But: If I want to publish the data source on my tableau server (via tableau_api_lib.TableauServerConnection.publish_data_source()), then I receive an error message (Code 400011). (My server version is 2022.3, together with maybe an older tableau bridge version).

After googling, it seems that tableau server needs the data structure (e.g. primary & foreign keys), for example defined via hyper-api in python or via an XML file that gets zipped together with the data itself before publishing.

Question/Feature Request: Can that be wrapped by Pantab? E.g. by optional arguments in the "frames_to_hyper" method with a mapping of columns regarding primary&foreign keys.

https://help.tableau.com/current/api/hyper_api/en-us/docs/hyper_api_publish.html

Example Python Script: https://github.com/tableau/hyper-api-samples/blob/main/Community-Supported/publish-multi-table-hyper/publish-multi-table-hyper.py

WillAyd commented 2 years ago

Thanks for the report. So reading through the documentation Option 2 would seem the most straightforward. When reading that section in particular it suggests that a model can be generated from the keys of the Hyper file, which I assume means it could support multiple tables in a Hyper file..but then the footnote at the bottom says option 2 explicitly does not work with multiple tables.

Do you have the same read on that as I? If so we might want to reach out to Tableau and ask for clarification

tucheck24 commented 2 years ago

I am also a bit confused as they mention in option 2 multiple tables but then in the footnote say it only works for simple tables. Did you have a look at their multi-table example (https://github.com/tableau/hyper-api-samples/tree/main/Community-Supported/publish-multi-table-hyper)?

It would be great to integrate that feature into Pantab! Did no one here in the forum had the need to publish multiple tables? For us it's quite essential, especially because tableau doesn't allow the join of several independent server data sources (which would have been a workaround to upload every dataframe individually and then join them on the server or client).

WillAyd commented 2 years ago

From the example you linked it looks like the TSC should just be able to publish the hyper file.

I'm somewhat neutral on adding this into pantab. On the one hand, I think it would be nice to have something like pantab.frame_to_hyper(<tableau_uri>). On the other hand, it seems like TSC already covers this. I don't know if there is a good open source solution to testing that things are actually working, so I might be slightly inclined to let Tableau continue to manage that in the TSC

@vogelsgesang @jonas-eckhardt

tucheck24 commented 2 years ago

Is this the only code block from their example needed to define the data model for tableau?

connection.execute_command(f'ALTER TABLE {customer_table.table_name} ADD ASSUMED PRIMARY KEY ("Customer ID")')
connection.execute_command(f'''ALTER TABLE {orders_table.table_name} ADD ASSUMED FOREIGN KEY ("Customer ID") REFERENCES  {customer_table.table_name} ( "Customer ID" )''')

And does that mean that we would have to bypass Pantab and use the TSC (with many additional lines of code) for everything regarding multiple tables, or can we add the data structure later on with the code snippet above?

What I love about Pantab is the feature, that it reduces the code to just one or two lines, wrapping all those low-level APIs in the background.

WillAyd commented 2 years ago

TSC and pantab do two different things. You wouldn't need to stop using pantab to use the former. Pantab creates the hyper file for you, TSC can upload it to the tableau server.

This is essentially what the example you linked does. It has separate functions for creating (L133) versus uploading (L134) the hyper file

https://github.com/tableau/hyper-api-samples/blob/25d19c53f0d2d101dcba31153d85b5cc69f33768/Community-Supported/publish-multi-table-hyper/publish-multi-table-hyper.py#L134

tucheck24 commented 2 years ago

Thanks a lot for your quick responses! Let me know how your discussion turns out (if the process can somehow be supported or simplified using pantab).

vogelsgesang commented 2 years ago

Option 2 from https://help.tableau.com/current/api/hyper_api/en-us/docs/hyper_api_publish.html is the way to go to publish multi-table files. Our footnote is outdated, and we should fix it (@jonas-eckhardt FYI)

You can still use pantab to insert the data, but for the addition of your foreign keys, you would use HyperAPI directly. To do so, you need to get your own HyperAPI connection. See https://pantab.readthedocs.io/en/latest/examples.html#providing-your-own-hyper-connection for an example of how to do this.

Adjusted to your use case, this would look something like this

import pandas as pd
import pantab
from tableauhyperapi import HyperProcess, Telemetry, Connection, CreateMode

path = "example.hyper"

with HyperProcess(Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hyper:
    data = {
         "customers": pd.DataFrame([[1, "Will"], [2, "Adrian"], [3, "Jonas"]], columns=["id","name"]),
         "orders": pd.DataFrame([[1, 32], [2,1], [2,12], [1,43], [3,21]], columns=["customer_id", "price"]),
    }
    pantab.frames_to_hyper(data, connection)
    with Connection(hyper.endpoint, path, CreateMode.NONE) as connection:
         connection.execute_command('ALTER TABLE "customers" ADD ASSUMED PRIMARY KEY ("id")')
         connection.execute_command('ALTER TABLE "orders" ADD ASSUMED FOREIGN KEY ("customer_id") REFERENCES  "orders" ( "id" )')
tucheck24 commented 1 year ago

Thanks a lot!

spark131008 commented 1 year ago

In order to add a PK constraint to one of the tables in the .hyper file, the PK column(s) must be NOT NULLABLE. In some cases, users need to use a timestamp column as a PK column. The real problem arises here because the timestamp dtype is always converted to NULLABLE column when using pantab. Thus, adding a PK constraint fails, which makes publishing multi-table .hyper file as a datasource fail as well.

It will be nice if pantab allows users to customize the table definition of a .hyper file it creates. Maybe pass it as a dict_of_frame parameter when calling this frames_from_hyper() func? I'm not sure what's the best approach, but this needs to be addressed to fully utilize frames_from_hyper() func and multi-table .hyper file publishing.

Screenshot 2023-03-07 at 8 36 59 PM

image

WillAyd commented 1 year ago

I think there is some overlap with this request then with https://github.com/innobi/pantab/issues/138 . pantab provides a simple mapping of dtypes from pandas to tableauhyperapi, and since there is no non-nullable datetime in pandas we don't get the same on write.

I think a PR to add something like schema_overrides could make sense

spark131008 commented 1 year ago

Correct. there's no way to use timestamp column as a PK column due to no non-nullable datetime in pandas. I was thinking to convert the timestamp column to string column, but still the same issue.

In any case, schema_overrides func will be very useful to achieve multi-table .hyper file publishing.

WillAyd commented 1 year ago

Great to hear - would you be interested in contributing that functionality?

spark131008 commented 1 year ago

Sure. I will start putting things together!

haarburger commented 1 year ago

Hi @vogelsgesang, I am facing similar problem (combining several pandas Dataframes in one .hyper file and trying to publish that to Tableau cloud. I tried to replicate your solution above but execute _command does not recognize the table name correctly:


with HyperProcess(Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hyper:
    data = {
        TableName('orders'): df_order,
        TableName('stores'): df_store,
        TableName('customers'): df_customer
           }
    pantab.frames_to_hyper(data, database='MyHyper.hyper', hyper_process=hyper)
    with Connection(endpoint=hyper.endpoint, database='MyHyper.hyper', create_mode=CreateMode.CREATE_AND_REPLACE) as connection:
        connection.execute_command('ALTER TABLE "exams" ADD ASSUMED PRIMARY KEY ("id")')

Gives me HyperException: table "orders" does not exist: line 1, column 13: ALTER TABLE "orders" ADD ASSUMED PRIMARY KEY ("id")

What am I doing wrong?

vogelsgesang commented 1 year ago

Hi @haarburger,

The problem is the line

with Connection(endpoint=hyper.endpoint, database='MyHyper.hyper', create_mode=CreateMode.CREATE_AND_REPLACE) as connection:

This line instructs Hyper to delete the file if it exists due to CreateMode.CREATE_AND_REPLACE. The correct solution would be CreateMode.NONE which means: If the file does not exist, don't try to create it. Just open the file, and if it doesn't exist, then raise an error

haarburger commented 1 year ago

Thanks a lot!

WillAyd commented 7 months ago

In order to add a PK constraint to one of the tables in the .hyper file, the PK column(s) must be NOT NULLABLE.

With pantab 4.0 all columns are nullable by default (as is the case with Arrow), but we now have an argument for not_null_columns in frame_to_hyper that allows you to override this behavior

WillAyd commented 7 months ago

I think this issue is closeable given the 4.0 feature described above, but ping back if you disagree and can reopen

tucheck24 commented 7 months ago

thanks, i will try out pantab 4.0 then :-)