innobi / pantab

Read/Write pandas DataFrames with Tableau Hyper Extracts
BSD 3-Clause "New" or "Revised" License
112 stars 44 forks source link

frame_from_hyper causing python app crash when reading .hyper extracts from tableau server #77

Closed chillerno1 closed 4 years ago

chillerno1 commented 4 years ago

Looking for some assistance debugging the following issue.

Describe the bug

I've got multiple .hyper files that have been downloaded from a Tableau Server using TSC. When using pantab.frame_from_hyper to read any of them in, the python application crashes.

Trace shows this as final exec before crash: _reader.py(45): df = pd.DataFrame(libreader.read_hyper_query(address, query, dtype_strs))

To Reproduce

Unfortunately I can't share the datasources this is occurring with, I'm happy to debug locally and provide as much detail as possible. This is a snippet of what I'm running.

import pantab
from tableauhyperapi import TableName

hyper = 'Data/Extracts/Job _Datasource.hyper'

df = pantab.frame_from_hyper(hyper, table=TableName("Extract", "Extract"))

Expected behavior

The dataframe should be read in without crashing the application.

Screenshots image

Desktop (please complete the following information):

Additional context

From what I've tested so far, .hyper files created using pantab are not affected. They can be read and do not cause the application to crash (even if I set a schema of TableName("Extract", "Extract")).

This only occurs when using pantab and I'm currently using tableauhyperapi as a temp workaround without issue:

import pandas as pd
from tableauhyperapi import HyperProcess, Telemetry, Connection, TableName

def tabapi_frame_from_hyper(db):

    table = TableName("Extract", "Extract")

    with HyperProcess(telemetry=Telemetry.DO_NOT_SEND_USAGE_DATA_TO_TABLEAU) as hyper:
        with Connection(endpoint=hyper.endpoint, database=db) as connection:

            table_definition = connection.catalog.get_table_definition(table)

            with connection.execute_query(query=f"SELECT * FROM {table}") as result:

                rows = list(result)  
                columns = [column.name.unescaped for column in table_definition.columns]

                return pd.DataFrame(rows, columns=columns)

df = tabapi_frame_from_hyper(db="Data/Extracts/Job _Datasource.hyper")
WillAyd commented 4 years ago

Can you try master? I wonder if this is resolved by #76

WillAyd commented 4 years ago

Though maybe not if this was an issue in 0.1 . There was no C extension for reading in 0.1 so surprised it would just crash without a traceback of any kind

chillerno1 commented 4 years ago

Good point, 0.1.0 is throwing this error and not crashing.

Traceback (most recent call last):
  File ".\pantaber.py", line 6, in <module>
    df = pantab.frame_from_hyper(hyper, table=TableName("Extract", "Extract"))
  File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\pantab\_reader.py", line 71, in frame_from_hyper
    return _read_table(connection=connection, table=table)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\pantab\_reader.py", line 43, in _read_table
    df = pd.DataFrame(result)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\frame.py", line 422, i
n __init__
    raise ValueError('DataFrame constructor not properly called!')

Will test the latest version and report back.

WillAyd commented 4 years ago

Hmm ok. Can you see what is getting passed to the DataFrame constructor? (The result variable).

It should be a Hyper Result object but might not be in your particular case given the error

chillerno1 commented 4 years ago

Definitely getting a hyper result -> <tableauhyperapi.result.Result object at 0x000000000B63DE48>

WillAyd commented 4 years ago

Does list(result) work?

chillerno1 commented 4 years ago

0.1.0 -> does list(result) work? Data type error, maybe this is where the reader is getting tripped up?

The column it's failing on contains a combination of datetime and null values.

Traceback (most recent call last):
  File "pantaber.py", line 6, in <module>
    df = pantab.frame_from_hyper(hyper, table=TableName("Extract", "Extract"))
  File "C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\pantab\_reader.py", line 71, in frame_from_hyper
    return _read_table(connection=connection, table=table)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\pantab\_reader.py", line 50, in _read_table
    df[key] = df[key].apply(lambda x: x._to_datetime())
  File "C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\series.py", line 4045, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
  File "C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\pantab\_reader.py", line 50, in <lambda>
    df[key] = df[key].apply(lambda x: x._to_datetime())
AttributeError: 'NoneType' object has no attribute '_to_datetime'

Few extra notes

  1. Built latest master and installed -> crash still occurring.
  2. Tested on a second clean VM just to make sure it wasn't env specific -> no dice.
  3. Tested with fresh Python 3.7 install -> no dice.
WillAyd commented 4 years ago

@chillerno1 can you provide a reproducible example somehow?

mathphysmx commented 4 years ago

It happened to me as well when trying to use it Ubuntu 18.04 AWS EC2 The examples in tableauhyperapi work without any issues. I tried to read the mydb.hyper generated using the code in tableauhyperapi and I get the following:

>>> import pantab
>>> pantab.frame_from_hyper("mydb.hyper", table="foo")
Segmentation fault (core dumped)
(py374) ubuntu1804@ec2:~/

I also tried to run https://github.com/innobi/pantab/blob/master/manylinux_build.sh but it throws the following error:

(py374) ubuntu1804@ec2:~/pantab$ sh manylinux_build.sh
manylinux_build.sh: 4: manylinux_build.sh: Bad substitution
WillAyd commented 4 years ago

Can you share the hyper file that is causing the segfault?

mathphysmx commented 4 years ago

Hi, thanks for your quick reply.

Generated using the code here https://help.tableau.com/current/api/hyper_api/en-us/reference/py/tableauhyperapi.html#sample-usage. In Windows 10 pantab.frame_from_hyper() works perfectly! Thanks for your great package.

On Sat, Apr 4, 2020 at 10:12 AM William Ayd notifications@github.com wrote:

Can you share the hyper file that is causing the segfault?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/77#issuecomment-609051343, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUFENYFKHWROZSGJLMMSADRK5L5TANCNFSM4KFASDTA .

-- Francisco Mendoza Torres Mobile: +52 5521526029

https://www.linkedin.com/in/mathematicalphysicist/   https://github.com/mathphysmx/

WillAyd commented 4 years ago

I've tried to reproduce using:

And the example was read without error:

>>> pantab.frame_from_hyper("mydb.hyper", table="foo")
   a  b
0  x  1
1  y  2

@mathphysmx can you verify the versions of items being used? Also can you check for a core dump?

mathphysmx commented 4 years ago

Hi!

I have the same versions. Maybe the problem is the Amazon EC2 Linux Machine. I have had other issues with that too that do not happen in a desktop Ubuntu 18.04 installations. Versions checked using

conda create -n py38 python=3.8 pandas xlrd lxml openpyxl seaborn plotly pyodbc sqlalchemy pymysql matplotlib flask bs4 selenium unidecode tableauserverclient pantab pip install tableauhyperapi

pip show tableauhyperapi

With no luck, I also tried installing pantab from source:

(py38) ec2@ip:~/pantab$ python setup.py install Traceback (most recent call last): File "setup.py", line 5, in from tableauhyperapi.impl.util import find_hyper_api_dll ImportError: cannot import name 'find_hyper_api_dll' from 'tableauhyperapi.impl.util' (/home/ubuntu/miniconda3/envs/py38/lib/python3.8/site-packages/tableauhyperapi/impl/util.py)

On Sat, Apr 4, 2020 at 11:38 AM William Ayd notifications@github.com wrote:

I've tried to reproduce using:

  • Ubuntu 18.04
  • Python 38
  • Tableauhyperapi 0.0.10309
  • pantab master

And the example was read without error:

pantab.frame_from_hyper("mydb.hyper", table="foo") a b0 x 11 y 2

@mathphysmx https://github.com/mathphysmx can you verify the versions of items being used? Also can you check for a core dump?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/77#issuecomment-609063647, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUFEN7EVY7C6V3MVWFW7ATRK5WBJANCNFSM4KFASDTA .

-- Francisco Mendoza Torres Mobile: +52 5521526029

https://www.linkedin.com/in/mathematicalphysicist/   https://github.com/mathphysmx/

WillAyd commented 4 years ago

Yea the building from source error looks to be from changes in the latest version of the Tableau hyper api. See https://github.com/innobi/pantab/issues/88 for description and a resolution

mathphysmx commented 4 years ago

Thanks for pantab!

On Sun, Apr 5, 2020, 9:40 AM William Ayd notifications@github.com wrote:

Yea the building from source error looks to be from changes in the latest version of the Tableau hyper api. See #88 https://github.com/innobi/pantab/issues/88 for description and a resolution

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/77#issuecomment-609426626, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUFEN5EAGYNSD4SWDJPCHTRLCJ7DANCNFSM4KFASDTA .

mathphysmx commented 4 years ago

Hi, William!

Trying to re-use some of the functions defined in pantab, my code crashes in

https://github.com/innobi/pantab/blob/master/pantab/_reader.py

libreader.read_hyper_query(connection._cdata, query, dtype_strs)

which seems to be part of pantab/_readermodule.c https://github.com/innobi/pantab/blob/master/pantab/_readermodule.c. I don't know C language, so I couldn't continue. But libreader shows-up in

https://github.com/innobi/pantab/blob/master/pantab/_readermodule.c

static struct PyModuleDef readermodule = {.m_base = PyModuleDef_HEAD_INIT, .m_name = "libreader", .m_methods = ReaderMethods};

PyMODINIT_FUNC PyInit_libreader(void) { PyDateTime_IMPORT; return PyModule_Create(&readermodule); }

Have a good day!

On Sun, Apr 5, 2020 at 9:40 AM William Ayd notifications@github.com wrote:

Yea the building from source error looks to be from changes in the latest version of the Tableau hyper api. See #88 https://github.com/innobi/pantab/issues/88 for description and a resolution

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/77#issuecomment-609426626, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUFEN5EAGYNSD4SWDJPCHTRLCJ7DANCNFSM4KFASDTA .

-- Francisco Mendoza Torres Mobile: +52 5521526029

https://www.linkedin.com/in/mathematicalphysicist/   https://github.com/mathphysmx/

WillAyd commented 4 years ago

@mathphysmx I created an EC2 image with 18.04 LTS x86 architecture and still was unable to produce a segfault. Is that the same platform as what you are using? Does the segfault happen all of the time or is it intermittent?

mathphysmx commented 4 years ago

Hi William!

I'm new to AWS, I'm using the one in the Amazon Machine Image below. Which image would you recommend?

[image: image.png]

Thanks in advance!

On Mon, Apr 6, 2020 at 10:30 AM William Ayd notifications@github.com wrote:

@mathphysmx https://github.com/mathphysmx I created an EC2 image with 18.04 LTS x86 architecture and still was unable to produce a segfault. Is that the same platform as what you are using? Does the segfault happen all of the time or is it intermittent?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/77#issuecomment-609865582, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUFEN2VUSFUDECSHUGERGTRLHYQLANCNFSM4KFASDTA .

-- Francisco Mendoza Torres Mobile: +52 5521526029

https://www.linkedin.com/in/mathematicalphysicist/   https://github.com/mathphysmx/

WillAyd commented 4 years ago

@mathphysmx I don't see an image; not sure if they come via email might need to update here on GitHub for it to appear

mathphysmx commented 4 years ago

Yes via email.

On Mon, Apr 6, 2020 at 10:30 AM William Ayd notifications@github.com wrote:

@mathphysmx https://github.com/mathphysmx I created an EC2 image with 18.04 LTS x86 architecture and still was unable to produce a segfault. Is that the same platform as what you are using? Does the segfault happen all of the time or is it intermittent?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/77#issuecomment-609865582, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUFEN2VUSFUDECSHUGERGTRLHYQLANCNFSM4KFASDTA .

-- Francisco Mendoza Torres Mobile: +52 5521526029

https://www.linkedin.com/in/mathematicalphysicist/   https://github.com/mathphysmx/

mathphysmx commented 4 years ago

Ubuntu Server 18.04 LTS (HVM), SSD Volume Type - ami-07ebfd5b3428b6f4d (64-bit x86) / ami-0400a1104d5b9caa1 (64-bit Arm)

On Mon, Apr 6, 2020 at 10:56 AM Francisco Mendoza mentofran@gmail.com wrote:

Yes via email.

On Mon, Apr 6, 2020 at 10:30 AM William Ayd notifications@github.com wrote:

@mathphysmx https://github.com/mathphysmx I created an EC2 image with 18.04 LTS x86 architecture and still was unable to produce a segfault. Is that the same platform as what you are using? Does the segfault happen all of the time or is it intermittent?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/77#issuecomment-609865582, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUFEN2VUSFUDECSHUGERGTRLHYQLANCNFSM4KFASDTA .

-- Francisco Mendoza Torres Mobile: +52 5521526029

https://www.linkedin.com/in/mathematicalphysicist/   https://github.com/mathphysmx/

-- Francisco Mendoza Torres Mobile: +52 5521526029

https://www.linkedin.com/in/mathematicalphysicist/   https://github.com/mathphysmx/

WillAyd commented 4 years ago

Hmm so I suspect then there might be an issue running on the ARM architecture; I'm generally not sure how well ARM is supported in the scientific computing stack

Unfortunately there doesn't appear to be a free tier to AWS for the ARM ubuntu architecture, so not something I think I can look into at the moment. But if you can isolate where the segfault is occurring that would be great (printf is very helpful for this)

Otherwise if not tied to the ARM architecture in your environment I think should work on x86

mathphysmx commented 4 years ago

Thanks William,

As I said previously, I think that is a problem with AWS. What EC2 instance do you have?

Thanks for your help!

On Mon, Apr 6, 2020 at 11:08 AM William Ayd notifications@github.com wrote:

Hmm so I suspect then there might be an issue running on the ARM architecture; I'm generally not sure how well ARM is supported in the scientific computing stack

Unfortunately there doesn't appear to be a free tier to AWS for the ARM ubuntu architecture, so not something I think I can look into at the moment. But if you can isolate where the segfault is occurring that would be great (printf is very helpful for this)

Otherwise if not tied to the ARM architecture in your environment I think should work on x86

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/innobi/pantab/issues/77#issuecomment-609887783, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUFEN4JWUGKHHIUZXFVSTDRLH47RANCNFSM4KFASDTA .

-- Francisco Mendoza Torres Mobile: +52 5521526029

https://www.linkedin.com/in/mathematicalphysicist/   https://github.com/mathphysmx/

WillAyd commented 4 years ago

I tried on the AWS EC2 18.04 Ubuntu LTS with x86 architecture and couldn't reproduce any issue. I think might be specific to the ARM architecture that you are using but there isn't a free tier to investigate that on

ShashankBharadwaj25 commented 4 years ago

I have a windows 10 system and I'm using the latest version of pandas, pantab and tableauhyperapi. I have run into the exact same issue and I'm using the tableauhypreapi temporarily, any idea on why the problem?

WillAyd commented 4 years ago

Can you provide a file to reproduce the issue?

Get Outlook for iOShttps://aka.ms/o0ukef


From: ShashankBharadwaj25 notifications@github.com Sent: Saturday, April 25, 2020 6:24:49 PM To: innobi/pantab pantab@noreply.github.com Cc: will_ayd will_ayd@innobi.io; Comment comment@noreply.github.com Subject: Re: [innobi/pantab] frame_from_hyper causing python app crash when reading .hyper extracts from tableau server (#77)

I have a windows 10 system and I'm using the latest version of pandas, pantab and tableauhyperapi. I have run into the exact same issue and I'm using the tableauhypreapi temporarily, any idea on why the problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/innobi/pantab/issues/77#issuecomment-619464085, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEU4UJBBFP7XXVFTUN4THTROOEODANCNFSM4KFASDTA.

ShashankBharadwaj25 commented 4 years ago

I'm not sure if I'm allowed to share the file since it's confidential information, let me see if I can cut the data source down to just a few records and then share. Thank you for the quick response! :D

WillAyd commented 4 years ago

Sounds good. Yea if you can do that it would be super helpful!

Will Ayd Owner, innobi innobi.iohttp://innobi.io/

From: ShashankBharadwaj25 notifications@github.com Reply-To: innobi/pantab reply@reply.github.com Date: Saturday, April 25, 2020 at 7:48 PM To: innobi/pantab pantab@noreply.github.com Cc: will_ayd will_ayd@innobi.io, Comment comment@noreply.github.com Subject: Re: [innobi/pantab] frame_from_hyper causing python app crash when reading .hyper extracts from tableau server (#77)

I'm not sure if I'm allowed to share the file since it's confidential information, let me see if I can cut the data source down to just a few records and then share. Thank you for the quick response! :D

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/innobi/pantab/issues/77#issuecomment-619471638, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEU4UMIOAALYOBSAQMMM6TROOOIFANCNFSM4KFASDTA.

ShashankBharadwaj25 commented 4 years ago

Hi, I did try to recreate it but having trouble doing so, I exported the original data into an excel sheet and loaded the new copy of the data source (after removing most of the data) on to Tableau Desktop and published the data source. Now when I download the tdsx and unzip it, there is no .hyper file there. Only the excel sheet that I provided is there. Not sure how to work around this, in what scenarios exactly is the hyper file created? What should be the source of data so that it is a .hyper file and not an excel or csv or whatever I provide?

WillAyd commented 4 years ago

Can you save as a packaged workbook locally? Not sure but depending on your server version it might not use hyper there

Get Outlook for iOShttps://aka.ms/o0ukef


From: ShashankBharadwaj25 notifications@github.com Sent: Saturday, April 25, 2020 8:51:28 PM To: innobi/pantab pantab@noreply.github.com Cc: will_ayd will_ayd@innobi.io; Comment comment@noreply.github.com Subject: Re: [innobi/pantab] frame_from_hyper causing python app crash when reading .hyper extracts from tableau server (#77)

Hi, I did try to recreate it but having trouble doing so, I exported the original data into an excel sheet and loaded the new copy of the data source (after removing most of the data) on to Tableau Desktop and published the data source. Now when I download the tdsx and unzip it, there is no .hyper file there. Only the excel sheet that I provided is there. Not sure how to work around this, in what scenarios exactly is the hyper file created? What should be the source of data so that it is a .hyper file and not an excel or csv or whatever I provide?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/innobi/pantab/issues/77#issuecomment-619476867, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEU4UPXKKVOQ5UDTCVSWE3ROOVUBANCNFSM4KFASDTA.

ShashankBharadwaj25 commented 4 years ago

I'm using 2019.4.5. I have recieved a packaged workbook + datasource (.twbx) on which I have tried the above mentioned.

vogelsgesang commented 4 years ago

A short update on this defect:

@ShashankBharadwaj25 provided me a repro of this issue via email and thereby saved the day! 🥇

The segfault comes from a combination of unfortunate conditions all fulfilled at the same time:

I see 3 follow-up items here:

  1. Fix error propagation inside the C module. This is tackled in #91
  2. Report errors on unsupported column type, instead of silently assuming they would be strings (#92)
  3. Adding support for the Date datatype. This is still open for grabs 🙂
ShashankBharadwaj25 commented 4 years ago

Great to see you guys working on this! Thank you so much for pantab, a really simple package to use which does wonders.

WillAyd commented 4 years ago

This should have been fixed with the 1.1.0 release of pantab, which is now available on pypi. Be sure to python -m pip install --upgrade pantab to get the latest and greatest