tableau / TabPy

Execute Python code on the fly and display results in Tableau visualizations:
https://tableau.github.io/TabPy/
MIT License
1.56k stars 598 forks source link

ATTR([dimmension]) returning None #486

Closed cwright98 closed 3 years ago

cwright98 commented 3 years ago

Environment information:

Describe the issue I have a Python script that returns the number of null values from a list of columns, for each row in a dataset. I am not able to query the client/endpoint, although the code runs fine when I run it in JupyterLab, and returns a list (maybe a series?) as expected. When I query the client in JupyterLab, it throws an error saying that a Series is not JSON serializable.

Additionally, when I try to make the calculated field in Tableau, it returns an empty list. I think this has to do with the fact that I am using ATTR([Dimmension]). When I use COUNT() or MIN() or MAX(), it at least returns a number, but that isn't helpful for what I am trying to accomplish. It appears the query is running as expected in a sense, because it outputs a single integer, 5, which is the expected output of the function I am querying.

I tried editing the table calculation to compute using the cell rather than the table- I saw that this worked for another user. It did not work, but it is possible it didn't save the change properly, I couldn't really tell, as the only option it gave me was to exit out after I changed it, rather than letting me click a "save" or "apply" button.

To Reproduce Steps to reproduce the behavior: what commands to run, what files to modify, where to look for an error.

Here is the code from JupyterLab:

from tabpy.tabpy_tools.client import Client import pandas as pd import numpy as np client = Client('http://localhost:9004/')

def get_null_sum(fico_column, flips_last_3_years_column, flips_last_24_months_column, new_construction_column, num_rental_units_owned_column): print('---'*40) print(type(fico_column)) print(fico_column) import pandas as pd df=pd.DataFrame({'Fico' : fico_column, 'flips_last_3_years': flips_last_3_years_column, 'last_24_months': flips_last_24_months_column, 'new_construction' : new_construction_column, 'num_rentals_owned' : num_rental_units_owned_column}) print(df.head()) null_counts=df.iloc[::, 1:5].isnull().sum(axis=1).tolist() return null_counts

client.deploy('get_null_sum', get_null_sum, 'Counts number of null values in row', override=True)

client.query('get_null_sum', fico_column, flips_last_3_years_column, flips_last_24_months_column, new_construction_column, num_rental_units_owned_column)['response']

Expected behavior

A clear and concise description of what you expected to happen.

I was hoping to use this script to return a column in Tableau that has the number, as an integer, of the count of Null values listed in the args. Ie if only two of them are Null, it would return 2, if all of them are Null, it would return 5.

Screenshots

Ouput from my Gitbash terminal: TabPy output

setup/deployment in JupyterLab: JupyterLab TabPy

Zoomed in on the error message in particular: Jupyter Lab TabPy error message

Additional context Add any other context about the problem here.

FYI I have all of the arguments defined in JupyterLab. (Fico_column, flips_last_3_years_column...etc)

Please let me know if there is anything I can clarify or further information I can provide that would be helpful.

Thanks!

cwright98 commented 3 years ago

If we could start a dialogue on this before the weekend starts, that would be fantastic, but I understand if the contributors are busy! I have done a fair amount of research on here but have been unable to successfully implement any solutions for similar problems by past users.

cwright98 commented 3 years ago

Adding onto this- I was able to successfully query the client in JupyterLab. I had some variables stored as a series, but I think it wanted a list instead. I added “.tolist()” to the end of fico_column=df[‘fico_column’]

cwright98 commented 3 years ago

@nmannheimer Do you think you'd have time to look into this today? If not, would you be willing to provide a rough estimate of when you think you or someone else could? Thank you and have a nice day!

nmannheimer commented 3 years ago

Hey @cwright98 for your first issue, you're correct: TabPy needs results returned as a Python list type, so other types like a numpy array or pandas series will throw an issue, but as you've seen it's easy to convert those to a list.

Regarding aggregations in Tableau, ATTR() can be used as an aggregation for dimensions, however, it will return a when there are multiple different values being aggregated. You can think of it as essentially testing: IF MIN(X) == MAX(X) return X ELSE . Computing for each cell would run the calculation separately, which I think is not what you want here, since I think you want to see all the marks together and count the nulls? I think the trick here is to create a visualization in Tableau that is at your expected level of detail for the function, essentially getting to the point where MIN() and ATTR() return the same result. You'll want to make sure your table calculation is addressing all dimensions, ie all boxes are checked if you manually compute the table calc.

cwright98 commented 3 years ago

Thank you! I moved all of the relevant dimensions to the detail box, and that allowed me to include those in the table calculation. I might need to raise a separate question thread for this, but I noticed in the command prompt I'm seeing what looks like my function getting called over and over for each row. Is this expected? I thought of Tableau as sending all of the data at once, performing the relevant calculations etc, then sending it back. Is that not the case?

nmannheimer commented 3 years ago

Hey @cwright98, you should be able to resolve this by going right clicking on the calc, selecting edit table calc.. and then selecting Specific Dimensions, then checking the boxes next to each dimension you've included. This will cause the calc to address everything on the view instead of partitioning on each combination of dimensions. The calc will make a separate call for each partition, so if that is each mark it can lead to many calls being made. You'll just want to make sure your code is designed to work with many rows of data being sent at once.

cwright98 commented 3 years ago

@nmannheimer I did what you said- but am still having issues with it in some instances.

If there are dimmensions being used in the dashboard, but not in use as part of my arguments for the function call, do they still need to be "ticked" on the Specific Dimmensions selection? If I don't, will TabPy still make a separate call for each partition?

There are also other filters in use on the dashboard, would that affect it?

My code is definitely designed to take a list as an argument. The calculated field gets computed very quickly when I have a test worksheet with only the dimmensions being used as arguments.

0golovatyi commented 3 years ago

@cwright98 Some additional information - http://tabscifi.com/2020/12/why-calculations-for-my-python-r-script_x-field-are-per-row/

cwright98 commented 3 years ago

Thank you! That link was helpful. Explaining resolved issue for anyone who comes across this:

I was using the dimension on Rows, and as a filter. In the Edit Table Calculation interface, I was selecting all of the relevant fields, but only for the dimension on Rows. After doing the exact same process for the same dimension on the Filter, it started working as I had hoped. Not sure why, but it appears the changes do not automatically apply across areas (Rows vs Filters in this case). Thanks again.