Closed tucheck24 closed 7 months ago
Thanks for the report. So reading through the documentation Option 2 would seem the most straightforward. When reading that section in particular it suggests that a model can be generated from the keys of the Hyper file, which I assume means it could support multiple tables in a Hyper file..but then the footnote at the bottom says option 2 explicitly does not work with multiple tables.
Do you have the same read on that as I? If so we might want to reach out to Tableau and ask for clarification
I am also a bit confused as they mention in option 2 multiple tables but then in the footnote say it only works for simple tables. Did you have a look at their multi-table example (https://github.com/tableau/hyper-api-samples/tree/main/Community-Supported/publish-multi-table-hyper)?
It would be great to integrate that feature into Pantab! Did no one here in the forum had the need to publish multiple tables? For us it's quite essential, especially because tableau doesn't allow the join of several independent server data sources (which would have been a workaround to upload every dataframe individually and then join them on the server or client).
From the example you linked it looks like the TSC should just be able to publish the hyper file.
I'm somewhat neutral on adding this into pantab. On the one hand, I think it would be nice to have something like pantab.frame_to_hyper(<tableau_uri>)
. On the other hand, it seems like TSC already covers this. I don't know if there is a good open source solution to testing that things are actually working, so I might be slightly inclined to let Tableau continue to manage that in the TSC
@vogelsgesang @jonas-eckhardt
Is this the only code block from their example needed to define the data model for tableau?
connection.execute_command(f'ALTER TABLE {customer_table.table_name} ADD ASSUMED PRIMARY KEY ("Customer ID")')
connection.execute_command(f'''ALTER TABLE {orders_table.table_name} ADD ASSUMED FOREIGN KEY ("Customer ID") REFERENCES {customer_table.table_name} ( "Customer ID" )''')
And does that mean that we would have to bypass Pantab and use the TSC (with many additional lines of code) for everything regarding multiple tables, or can we add the data structure later on with the code snippet above?
What I love about Pantab is the feature, that it reduces the code to just one or two lines, wrapping all those low-level APIs in the background.
TSC and pantab do two different things. You wouldn't need to stop using pantab to use the former. Pantab creates the hyper file for you, TSC can upload it to the tableau server.
This is essentially what the example you linked does. It has separate functions for creating (L133) versus uploading (L134) the hyper file
Thanks a lot for your quick responses! Let me know how your discussion turns out (if the process can somehow be supported or simplified using pantab).
Option 2 from https://help.tableau.com/current/api/hyper_api/en-us/docs/hyper_api_publish.html is the way to go to publish multi-table files. Our footnote is outdated, and we should fix it (@jonas-eckhardt FYI)
You can still use pantab to insert the data, but for the addition of your foreign keys, you would use HyperAPI directly. To do so, you need to get your own HyperAPI connection. See https://pantab.readthedocs.io/en/latest/examples.html#providing-your-own-hyper-connection for an example of how to do this.
Adjusted to your use case, this would look something like this
import pandas as pd
import pantab
from tableauhyperapi import HyperProcess, Telemetry, Connection, CreateMode
path = "example.hyper"
with HyperProcess(Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hyper:
data = {
"customers": pd.DataFrame([[1, "Will"], [2, "Adrian"], [3, "Jonas"]], columns=["id","name"]),
"orders": pd.DataFrame([[1, 32], [2,1], [2,12], [1,43], [3,21]], columns=["customer_id", "price"]),
}
pantab.frames_to_hyper(data, connection)
with Connection(hyper.endpoint, path, CreateMode.NONE) as connection:
connection.execute_command('ALTER TABLE "customers" ADD ASSUMED PRIMARY KEY ("id")')
connection.execute_command('ALTER TABLE "orders" ADD ASSUMED FOREIGN KEY ("customer_id") REFERENCES "orders" ( "id" )')
Thanks a lot!
In order to add a PK constraint to one of the tables in the .hyper file, the PK column(s) must be NOT NULLABLE. In some cases, users need to use a timestamp column as a PK column. The real problem arises here because the timestamp dtype is always converted to NULLABLE column when using pantab. Thus, adding a PK constraint fails, which makes publishing multi-table .hyper file as a datasource fail as well.
It will be nice if pantab allows users to customize the table definition of a .hyper file it creates. Maybe pass it as a dict_of_frame
parameter when calling this frames_from_hyper() func? I'm not sure what's the best approach, but this needs to be addressed to fully utilize frames_from_hyper() func and multi-table .hyper file publishing.
I think there is some overlap with this request then with https://github.com/innobi/pantab/issues/138 . pantab provides a simple mapping of dtypes from pandas to tableauhyperapi, and since there is no non-nullable datetime in pandas we don't get the same on write.
I think a PR to add something like schema_overrides
could make sense
Correct. there's no way to use timestamp column as a PK column due to no non-nullable datetime in pandas. I was thinking to convert the timestamp column to string column, but still the same issue.
In any case, schema_overrides
func will be very useful to achieve multi-table .hyper file publishing.
Great to hear - would you be interested in contributing that functionality?
Sure. I will start putting things together!
Hi @vogelsgesang, I am facing similar problem (combining several pandas Dataframes in one .hyper
file and trying to publish that to Tableau cloud. I tried to replicate your solution above but execute _command
does not recognize the table name correctly:
with HyperProcess(Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hyper:
data = {
TableName('orders'): df_order,
TableName('stores'): df_store,
TableName('customers'): df_customer
}
pantab.frames_to_hyper(data, database='MyHyper.hyper', hyper_process=hyper)
with Connection(endpoint=hyper.endpoint, database='MyHyper.hyper', create_mode=CreateMode.CREATE_AND_REPLACE) as connection:
connection.execute_command('ALTER TABLE "exams" ADD ASSUMED PRIMARY KEY ("id")')
Gives me HyperException: table "orders" does not exist: line 1, column 13: ALTER TABLE "orders" ADD ASSUMED PRIMARY KEY ("id")
What am I doing wrong?
Hi @haarburger,
The problem is the line
with Connection(endpoint=hyper.endpoint, database='MyHyper.hyper', create_mode=CreateMode.CREATE_AND_REPLACE) as connection:
This line instructs Hyper to delete the file if it exists due to CreateMode.CREATE_AND_REPLACE
. The correct solution would be CreateMode.NONE
which means: If the file does not exist, don't try to create it. Just open the file, and if it doesn't exist, then raise an error
Thanks a lot!
In order to add a PK constraint to one of the tables in the .hyper file, the PK column(s) must be NOT NULLABLE.
With pantab 4.0 all columns are nullable by default (as is the case with Arrow), but we now have an argument for not_null_columns
in frame_to_hyper that allows you to override this behavior
I think this issue is closeable given the 4.0 feature described above, but ping back if you disagree and can reopen
thanks, i will try out pantab 4.0 then :-)
Pantab supports saving multiple dataframes into one hyperfile. That works and can be opened by my local tableau desktop.
But: If I want to publish the data source on my tableau server (via tableau_api_lib.TableauServerConnection.publish_data_source()), then I receive an error message (Code 400011). (My server version is 2022.3, together with maybe an older tableau bridge version).
After googling, it seems that tableau server needs the data structure (e.g. primary & foreign keys), for example defined via hyper-api in python or via an XML file that gets zipped together with the data itself before publishing.
Question/Feature Request: Can that be wrapped by Pantab? E.g. by optional arguments in the "frames_to_hyper" method with a mapping of columns regarding primary&foreign keys.
https://help.tableau.com/current/api/hyper_api/en-us/docs/hyper_api_publish.html
Example Python Script: https://github.com/tableau/hyper-api-samples/blob/main/Community-Supported/publish-multi-table-hyper/publish-multi-table-hyper.py