spotfiresoftware / spotfire-python

Package for Building Python Extensions to Spotfire®
Other
18 stars 6 forks source link

append mode to sbdf export #45

Open lwlwlwlw opened 1 year ago

lwlwlwlw commented 1 year ago

Feature requests: Would it be possible to add append mode to sbdf export (append data to an existing sbdf file)? Thank you.

bbassett-tibco commented 1 year ago

Hi @lwlwlwlw! Can you be more specific (perhaps with an example of what API you are expecting) about what kind of appending you are looking for? Are you looking for appending rows, appending columns, or some other concept of appending?

I'll also comment that the process of appending is complicated by the inherent structure of SBDF files. They are laid out as a sequence of table slices (consisting of a number of rows) that contain a sequence of column slices (consisting of all the values in one column in the rows covered by the containing table slice). To append a row involves rewriting (and growing) all the column slices in the last table slice; appending a column will have to rewrite all table slices (and would probably require rebalancing rows between slices since there is target number of values (rows x columns) in each table slice for performance reasons).

In general, it's probably easier to import the data from the file, make any modifications to the data that are desired, and then exporting.

lwlwlwlw commented 1 year ago

@bbassett-tibco Thank you for your reply.

One of our customers wants to append data (rows) to existing sbdf file because the data is incremental.

Preferable something like this, ("append=True" option to indicate appending)

import spotfire.sbdf as sb df=data.frame(...) sb.export_data(df,"d:/tmp/file.sbdf", append=True)

bschwartzjetrock commented 1 month ago

Hello!

I'd like to second lwlwlwlw's request for an append mode to SBDF files and provide additional context for why this feature would be extremely valuable.

Many of my clients require processing and exporting of large amounts of data (often exceeding available RAM) from various file formats and SQL databases into SBDF files. Our typical workflow involves Python processes where we perform data cleaning and formatting before converting to SBDF. This approach ensures that the Spotfire project loads pre-processed, clean data, significantly improving load times and project performance.

However, I am facing challenges with the existing export_data function, which seems to be designed primarily for in-memory pandas DataFrames. This becomes problematic when dealing with datasets that exceed available RAM.

Currently, my workaround is to split larger datasets into multiple SBDF files and Spotfire "concatenates" them as the project loads, but this increases loading time. This is particularly inefficient given the explanation provided earlier about the complexity of appending: "To append a row involves rewriting (and growing) all the column slices in the last table slice; appending a column will have to rewrite all table slices (and would probably require rebalancing rows between slices since there is target number of values (rows x columns) in each table slice for performance reasons)."

Given these challenges, I am wondering if you could provide guidance or consider implementing features that allow for more memory-efficient handling of large datasets during the export process. Specifically:

  1. An append mode in the spotfire-python API that efficiently handles the rewriting and rebalancing of slices.
  2. A method for streaming or chunked processing that allows writing to SBDF without loading entire datasets into memory, which would work well with our existing data cleaning pipelines.
  3. A version of the export function that takes a file with appending capabilities as input (like Parquet) and performs a file-to-file transfer.

I understand that the SBDF file structure makes this complex, but any insights or potential solutions would be greatly appreciated. If full append functionality isn't feasible, are there alternative approaches or best practices you'd recommend for handling these large, pre-processed datasets more efficiently?

Thank you for your consideration of this feature request and any guidance you can provide.

bbassett-tibco commented 1 month ago

OK, given @bschwartzjetrock's well written problem description, I'm beginning to think that a potential solution to this request would look like:

We can definitely investigate the concept further in a future release.