innobi / pantab

Read/Write pandas DataFrames with Tableau Hyper Extracts
BSD 3-Clause "New" or "Revised" License
94 stars 35 forks source link

Further Performance Optimization? #41

Closed cedricyau closed 7 months ago

cedricyau commented 4 years ago

Amazing work on the C Module for enhanced performance. Question for large table conversion since I've noticed that the hyper api is able to COPY from csv faster than a Pandas Dataframe already in memory:

Is there room for even more improvement to pass the underlying Numpy Arrays instead of iterating through the tuples, which requires wrapping and unwrapping each value in an PyObject?

https://github.com/innobi/pantab/blob/5fc6a6471e521c94dfc82d10239051997ef175f1/pantab/_writermodule.c#L273

(I don't know enough about how Pandas String Series are handled to know if this would help or not.)

Another option might be to convert the Pandas DataFrame to a pyarrow Table and then pass that, as pyarrow has nullable ints and strings. This would allow for a pyarrow.parquet.read_table().to_hyper() as a data path.

WillAyd commented 4 years ago

The problem with numpy arrays with respect to their utilization in pandas is that pandas is really a columnar storage format (though not explicitly stated). The Tableau Hyper API expects rows to be written, so you need to do the work to transpose from a columnar format to a row-oriented format. Because of that you need to explicitly hop around various numpy arrays to get the values you want for a row, so you don't get the benefit of traversing a contiguous array of values in memory regardless (that is what makes Numpy fast)

There might be some micro-optimizations to trying to traverse numpy values, but I'm not sure they exist and it would certainly come at the cost of more advanced development. Also keep in mind that some types in pandas aren't truly single numpy arrays (see ExtensionArrays) so iteration by df.itertuples is a much safer bet to reduce complexity, and will be more general purpose when trying to get spatial data into pantab (most likely by allow geopandas frames to be provided)

Similar comment on the arrow format; arrow is a columnar format, Hyper is not (at least not from a serialization perspective). I can't speak to if there would be performance differences in going pandas -> arrow -> hyper, but again more complexity.

Out of curiosity what benchmarks are you using to compare COPY to pantab? Would help frame any reply there

WillAyd commented 4 years ago

By the way I don't want my response above to be taken as a sign that I wouldn't accept contributions that utilize any of those methods if they improve performance. If it's a point of interest for you and you want to try by all means! Just wanted to provide my reasoning for the existing design, in the hopes it is useful

cedricyau commented 4 years ago

I've attached a Jupyter Notebook that uses sample data. First, I generate a 10MM row dataframe with 5 numeric columns. Then I add a sample string column and a sample datetime column.

This is exported to a CSV (takes about 1m50s) with pandas.dataframe.to_csv().

The dataframe is directly copied using pantab, which takes about 35 seconds.

The CSV file is finally COPY'd using Tableau Hyper API, and takes 9.56 seconds.

I did notice that with numerics only, it is much closer--11 seconds for pantab and 9 seconds for COPY.

pantab research benchmark.zip

WillAyd commented 4 years ago

Cool thanks for the file. Looks like writing with pantab was 30 seconds for me, but dropping the datetime column alone got that down to 12 seconds, which would be in line with a COPY.

So the performance bottleneck I believe is the datetime stuff. Looking in the extension module there are calls to PyObject_GetAttrString for day / month / microsecond components that can probably be replaced with the appropriate macros:

https://docs.python.org/3/c-api/datetime.html?highlight=pydate#c.PyDateTime_DATE_GET_MICROSECOND

Interested in submitting a PR for that?

cedricyau commented 4 years ago

Unless I'm missing something, I believe that's already been implemented.

https://github.com/innobi/pantab/blob/5fc6a6471e521c94dfc82d10239051997ef175f1/pantab/_writermodule.c#L161

It does look like the ultimate result is to get to an int64 representation.

        int64_t val = time + (int64_t)date * MICROSECONDS_PER_DAY;

        result = hyper_inserter_buffer_add_int64(insertBuffer, val);

There may be a way to get here directly with something like df['date_column'].astype('l'), which would then pass along the value as an int64 series, followed any adjustments for whether tableau expects nanoseconds or microseconds, and the offset adjustment for the epoch.

Unfortunately, I'm not seeing anything in the Tableau API docs that discusses the underlying int64 value. Will research more later. https://help.tableau.com/current/api/hyper_api/en-us/reference/cxx/classhyperapi_1_1Date.html

cedricyau commented 4 years ago

Today I learned about the Julian date.

I modified types to define the hyper column as an int64 and then tried a few values. The int value is in microseconds. There is a consistent diff of 210866803200000000, which translates to 2440588.0 days exactly.

With some googling, this exactly the number of days different between the Gregogian Unix epoch of Jan 1, 1970 and the Julian Date of Noon, UTC, January 1, 4713 B.C.

http://seann.herdejurgen.com/resume/samag.com/html/v12/i07/a2_s1.htm

So, this now looks like a matter of casting the date column in pandas to an int64 and subtracting the diff to get the Julian date.

WillAyd commented 4 years ago

Ah yep nice. Sorry read from my phone incorrectly originally.

So you think the bottleneck are the calls to hyper_encode_date and hyper_encode_time? If so would be good to measure and give that feedback to Tableau

I’m not sure exactly how those functions are implemented so would be hesistant to roll another implementation in pantab as differences would be slight and hard to debug

cedricyau commented 4 years ago

I think it’s the combo. There’s a decent amount of calculation needed to get from the unified int64 to a year, then month, day, etc...

Then you combine everything back.

Perhaps one option is to implement as a fast_date parameter that tries to do the Unix epoch to Julian Date conversion in frame_to_hyper().

What do you think?

WillAyd commented 4 years ago

If you step back through the PR for the extension module you'll see I previously did something closer to what you suggested:

https://github.com/innobi/pantab/pull/30/commits/0e2f161d1662259a8e0083913ede8c47f6d0129d

But this caused some test failures and I think might have been the result of some off-by-one or rounding issues.

Feel free to benchmark further. For what it's worth though I would be really hesitant to take this one for reasons stated above

cedricyau commented 4 years ago

Yes, the code would result in an off-by-1 error. It appears that Tableau implemented their epoch starting at midnight instead of noon. So the offset is 2440588.0 instead of 2440587.5 (which is rounded down by the int() cast in 0e2f161.

In [3]: pd.to_datetime('1970-01-01').to_julian_date() Out[3]: 2440587.5

cedricyau commented 4 years ago

Thanks to 0e2f161, I got a head start on implementation and tying back. You'll see in the attached that re-querying indeed produces the correct result.

pantab research benchmark-optimized.zip

Before

Initial copy to hyper without date CPU times: user 6.13 s, sys: 726 ms, total: 6.86 s Wall time: 12.7 s

Final copy to hyper with date CPU times: user 28.7 s, sys: 772 ms, total: 29.5 s Wall time: 35.1 s

This gives a delta of 22.4 seconds.

After

Initial copy to hyper without date CPU times: user 6.25 s, sys: 673 ms, total: 6.92 s Wall time: 13.2 s

Final copy to hyper with date CPU times: user 8.19 s, sys: 1.06 s, total: 9.26 s Wall time: 16.6 s

This gives a delta of 3.4 seconds, which is a 6.58x speed-up.


Also included in the revised Jupyter Notebook is a section with some cython testing to see how quickly the dataframe can be traversed if presented as numpy arrays. Looks like the 10MM rows with 70MM values can be traversed in 111ms, so there could probably be another 2x gain based on how fast the code runs with the writeData call to insert data to hyper commented out.

Not sure that is worth the implementation risks as you indicated before.

WillAyd commented 4 years ago

Great benchmarks again. If you can isolate time spent in the hyper calls somehow we should provide that to Tableau. There’s nothing magic we’d be doing outside of rather plain addition / multiplication, so surprising it would make that much of a difference

WillAyd commented 4 years ago

Note that they offer C++ bindings from their site and that is what most of their developers use. You might be able to copy the samples provided with that and tweak Datetime examples to see how those run

cedricyau commented 4 years ago
In [5]: dt=datetime.now()

In [9]: %timeit dt.fromtimestamp(dt.timestamp())        
1.59 µs ± 6.03 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: %timeit dt.timestamp()                        
851 ns ± 0.441 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
                                                                                                     In [11]: ts = dt.timestamp()                            
In [12]: %timeit datetime.fromtimestamp(ts)             
778 ns ± 0.327 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Although above uses floating point timestamps, the time needed to convert 10MM datetimes to float and back is about 16 seconds, which correlated to the the speed up by removing the conversion step and passing the int64 directly.

I don’t think the speed issue is specific to the Hyper implementation.

Maybe we can ask tableau to disclose the specifics of what the int64 represents to confirm?

WillAyd commented 4 years ago

Thanks! But unfortunately comparing those calls in the Python space isn't totally relevant as they would introduce overhead in their own constructors. Would need some way to isolate what is going on down at the C level based off of the previous info you provided

cedricyau commented 4 years ago

Found a few definitions on the C++ side of things. It is indeed Julian Day, but they don't say anything about noon vs midnight.

include/hyperapi/Date.hpp:123
   /**
     * The raw date value. This is the Julian Day, which is the number of days since 1 January 4713 BC.
     */
   hyper_date_t m_representation = 0;

include/hyperapi/Timestamp.hpp:111
    /**
        * Gets the raw timestamp value encoded as microseconds since 1 January 4713 BC.
        *
        * \return The raw timestamp value.
        */
    hyper_timestamp_t getRaw() const noexcept;

    /**
        * The raw timestamp value encoded as microseconds since 1 January 4713 BC.
        */
    hyper_timestamp_t m_representation = 0;
    /**
        * The date component of the raw timestamp value.
        */
    Date m_date;
    /**
        * The time component of the raw timestamp value.
        */
    Time m_time;
cedricyau commented 4 years ago

Tableau has an official response regarding dates:

https://community.tableau.com/message/1013552?et=watches.email.thread#1013552

cedricyau commented 3 years ago

Tableau changed their URL structure. Link is now:

https://community.tableau.com/s/question/0D54T00000C5Qd1SAF/tableau-hyper-api-datetime-int64-format

cedricyau commented 3 years ago

Finally getting around to trying to do a PR. #125

WillAyd commented 8 months ago

With pantab 4.0 I think the biggest thing we can do is remove the chrono translations / dependencies. However, the Tableau Hyper API today does not allow for construction of date / time objects from time points, instead requiring a date/time struct. The Tableau team agreed with the feedback that this would be nice, so at this point we are just waiting for that to be implemented

WillAyd commented 7 months ago

closing for now as nothing actionable can be done right now. will keep tabs on the hyper api to see how we can avoid the chrono dependency in the future