append mode to sbdf export

lwlwlwlw commented 1 year ago

Feature requests: Would it be possible to add append mode to sbdf export (append data to an existing sbdf file)? Thank you.

bbassett-tibco commented 1 year ago

Hi @lwlwlwlw! Can you be more specific (perhaps with an example of what API you are expecting) about what kind of appending you are looking for? Are you looking for appending rows, appending columns, or some other concept of appending?

I'll also comment that the process of appending is complicated by the inherent structure of SBDF files. They are laid out as a sequence of table slices (consisting of a number of rows) that contain a sequence of column slices (consisting of all the values in one column in the rows covered by the containing table slice). To append a row involves rewriting (and growing) all the column slices in the last table slice; appending a column will have to rewrite all table slices (and would probably require rebalancing rows between slices since there is target number of values (rows x columns) in each table slice for performance reasons).

In general, it's probably easier to import the data from the file, make any modifications to the data that are desired, and then exporting.

lwlwlwlw commented 1 year ago

@bbassett-tibco Thank you for your reply.

One of our customers wants to append data (rows) to existing sbdf file because the data is incremental.

Preferable something like this, ("append=True" option to indicate appending)

import spotfire.sbdf as sb df=data.frame(...) sb.export_data(df,"d:/tmp/file.sbdf", append=True)

bschwartzjetrock commented 1 month ago

Hello!

I'd like to second lwlwlwlw's request for an append mode to SBDF files and provide additional context for why this feature would be extremely valuable.

Many of my clients require processing and exporting of large amounts of data (often exceeding available RAM) from various file formats and SQL databases into SBDF files. Our typical workflow involves Python processes where we perform data cleaning and formatting before converting to SBDF. This approach ensures that the Spotfire project loads pre-processed, clean data, significantly improving load times and project performance.

However, I am facing challenges with the existing export_data function, which seems to be designed primarily for in-memory pandas DataFrames. This becomes problematic when dealing with datasets that exceed available RAM.

Currently, my workaround is to split larger datasets into multiple SBDF files and Spotfire "concatenates" them as the project loads, but this increases loading time. This is particularly inefficient given the explanation provided earlier about the complexity of appending: "To append a row involves rewriting (and growing) all the column slices in the last table slice; appending a column will have to rewrite all table slices (and would probably require rebalancing rows between slices since there is target number of values (rows x columns) in each table slice for performance reasons)."

Given these challenges, I am wondering if you could provide guidance or consider implementing features that allow for more memory-efficient handling of large datasets during the export process. Specifically:

An append mode in the spotfire-python API that efficiently handles the rewriting and rebalancing of slices.
A method for streaming or chunked processing that allows writing to SBDF without loading entire datasets into memory, which would work well with our existing data cleaning pipelines.
A version of the export function that takes a file with appending capabilities as input (like Parquet) and performs a file-to-file transfer.

I understand that the SBDF file structure makes this complex, but any insights or potential solutions would be greatly appreciated. If full append functionality isn't feasible, are there alternative approaches or best practices you'd recommend for handling these large, pre-processed datasets more efficiently?

Thank you for your consideration of this feature request and any guidance you can provide.

bbassett-tibco commented 1 month ago

OK, given @bschwartzjetrock's well written problem description, I'm beginning to think that a potential solution to this request would look like:

If the goal is to avoid loading the full SBDF file into memory, we'll need to work between an input and output SBDF filename (the 'sbdf-c' library doesn't have support for changing the contents of an SBDF file, only reading or writing from them).
export_data is probably not the right function to add this to. There is an impedance mismatch in this case: 1) export_data's argument list is not set up for two SBDF filenames 2) export_data allows for data of different shapes, while appending requires a specific shape for the data We should add a new function (or pair), tentatively to be called append_rows (and append_columns, if we implement it).
We'll definitely have the new function(s) raise an error if the data to append does not match the shape and Spotfire types of the existing SBDF file (the column version would have a way to pass in Spotfire typing information).

We can definitely investigate the concept further in a future release.

spotfiresoftware / spotfire-python

append mode to sbdf export #45