innobi / pantab

Read/Write pandas DataFrames with Tableau Hyper Extracts
BSD 3-Clause "New" or "Revised" License
114 stars 44 forks source link

Segfault starting Nov 9 (coincides with latest tableauhyperapi library release) #116

Closed jla2w closed 3 years ago

jla2w commented 4 years ago

When creating large .hyper files from pandas dataframes, starting on Nov 9 my code started segfaulting. "Large" in this case is a few million rows. When I pin the pantab version OR the tableauhyperapi version to the previous version I don't segfault. Heres the error I get from JRE in my production platform (i can try and reproduce locally which may yield better error info)

A fatal error has been detected by the Java Runtime Environment: SIGSEGV (0xb) at pc=0x00007fccf6b852c0, pid=1, tid=0x00007fcd08d45740 JRE version: OpenJDK Runtime Environment (8.0_242-b08) (build 1.8.0_242-b08) Java VM: OpenJDK 64-Bit Server VM (25.242-b08 mixed mode linux-amd64 compressed oops) Problematic frame: C [libtableauhyperapi.so+0x14c2c0]

To Reproduce Steps to reproduce the behavior: I create a large pandas dataframe (>1MM rows by 50 fields) and call

my_table = TableName("Extract", "Extract")
hyper_file = "./" + data_source_name + ".hyper"
pantab.frame_to_hyper(pd_dataframe, hyper_file, table=my_table)

Expected behavior This only started segfaulting Nov 9, no code chgs. My expectation is that the hyper file get created without segfault

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Couple other additional notes, I have published 4 dataframes to tableau hourly for last few months. 2 of the four continued to succeed after Nov 9, the 2 much larger data frames started failing, the data in the 2 that started to segfault is completely different, although it's possible it's not a size issue it could be something in the data

WillAyd commented 4 years ago

The problematic frame actually comes from Tableau's Hyper API. Have you tried using that directly without pantab to see if you get the same issue? If so we may need to report that upstream

jla2w commented 4 years ago

I have not. I feel like it is tableauhyperapi and not pantab, for one thing it started on the day of the most recent release... I also did the following if it helps narrow it down.

I pinned the pantab version and NOT the tableauhyperapi version and the code succeeded I pinned the tableauhyperapi version and NOT the pantab version and it succeeded

So if I pin pantab, does tableauhyperapi automatically get pinned to a prior version? If so and assuming pantab depends on tableauhyperapi and not other way around, I think it's upstream. I just didn't see where that repo was to report honestly

WillAyd commented 4 years ago

Currently pantab only enforces a minimum version of the tableauhyperapi but not an upper bound. I don't believe the tableau hyper api has a stable ABI yet, so it is possible that newer versions can cause binary incompatibility

@vogelsgesang

vogelsgesang commented 4 years ago

Hi @jla2w,

I think @WillAyd might be correct that this might be due to a binary incompatibility. I don't want to rule out a bug inside HyperAPI, yet, though...

The crash can be reproed with

import pantab
import pandas as pd
import numpy as np

N = 1000000
rand = np.random.randint(0, 100 , size=N)
pd_dataframe = pd.DataFrame({ key : rand for key in (f"col{i}" for i in range (0, 50)) })
pantab.frame_to_hyper(pd_dataframe, "./test.hyper", table="Extract")

using tableauhyperapi = 0.0.11691 pantab = 1.1.1

jla2w commented 4 years ago

Tx @WillAyd and @vogelsgesang. Let me know if this is better off in another repo/forum, this seemed the best place to report it although I agree it seems more likely to reside in hyperapi

vogelsgesang commented 4 years ago

no worries, I already created a defect for this in our internal bug tracker here at Tableau/Hyper. Reporting here is fine :)

I verified that downgrading to the previous HyperAPI version fixes the issue, i.e.

pip install tableauhyperapi==0.0.11556

should unblock you for now.

Looking through which changes went into the lastest release, I think this is indeed a binary incompatibility :/ As such, rebuilding pantab against the latest HyperAPI should resolve the issue. We should really find a long-term solution to move pantab over to a stable and officially supported interface...

WillAyd commented 4 years ago

Thanks for confirming! I think for now we should add an upper limit to the hyper api being used and later can bump the minimum - @jla2w would you be interested in contributing that change?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Adrian Vogelsgesang notifications@github.com Sent: Thursday, November 12, 2020 7:31:37 AM To: innobi/pantab pantab@noreply.github.com Cc: will_ayd will_ayd@innobi.io; Mention mention@noreply.github.com Subject: Re: [innobi/pantab] Segfault starting Nov 9 (coincides with latest tableauhyperapi library release) (#116)

no worries, I already created a defect for this in our internal bug tracker here at Tableau/Hyper. Reporting here is fine :)

I verified that downgrading to the previous HyperAPI version fixes the issue, i.e.

pip install tableauhyperapi==0.0.11556

should unblock you for now.

Looking through which changes went into the lastest release, I think this is indeed a binary incompatibility :/ As such, rebuilding pantab against the latest HyperAPI should resolve the issue. We should really find a long-term solution to move pantab over to a stable and officially supported interface...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/innobi/pantab/issues/116#issuecomment-726151394, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEU4UJIRJMUKOYJIFIVCQ3SPP5VTANCNFSM4TSKGQPA.

WillAyd commented 3 years ago

This should be resolved by #128 which was released in Pantab 2.0