jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.48k stars 631 forks source link

Best way to insert/append single/few records? #199

Closed numericOverflow closed 6 years ago

numericOverflow commented 6 years ago

I know TableSaw is intended for mass insertion of data, but I've got a situation where I have most of the data, but want to add a few rows during processing. My intention is to work records much like a financial ledger, and am converting code from a python proof-of-concept that used Pandas dataframes.

My rough algorithm:

So my question is what's the most efficient way to add small sets of rows to an existing Table?
--Oversize the original table & update existing rows as desired? --Copy table using emptyCopy(1), update values & append new single-row table to original table? --Something else?

The section of the userguide on the wordpress site related to adding/removing rows is blank, and I haven't been able to find much in the way of examples showing what I'm looking to do. There's lots of good examples on columnar work, but not much row-wise that I've been able to find. It's like I need an appendRow() function that took in a string/array/list/etc row and appended it to the table.

Seems like TableSaw is geared for an "insert-once, analyze-many" approach whereas my use case is an "insert-many, analyze-many" situation, so I want to be efficient in my approach. I like the flexibility & built-in analytics TableSaw has, so I wouldn't need to start from scratch with a custom approach.

Any strategy suggestions would be greatly appreciated!

lwhite1 commented 6 years ago

Hard to say without more information, but I would consider creating a second table with the same schéma. Adding your data to that table and then appending the new table to the original

On Wed, Nov 15, 2017 at 2:08 PM numericOverflow notifications@github.com wrote:

I know TableSaw is intended for mass insertion of data, but I've got a situation where I have most of the data, but want to add a few rows during processing. My intention is to work records much like a financial ledger, and am converting code from a python proof-of-concept that used Pandas dataframes.

My rough algorithm:

  • Load a batch of new data (hundreds to thousands of records)
  • process data loop -- insert forecast records (1 record at a time) -- reanalyze w/new data to adjust size/amount/placement of next forecast record -- continue processing data loop until all forecasting calculations complete and data is "balanced"

So my question is what's the most efficient way to add small sets of rows to an existing Table? --Oversize the original table & update existing rows as desired? --Copy table using emptyCopy(1), update values & append new single-row table to original table? --Something else?

The section of the userguide on the wordpress site https://jtablesaw.wordpress.com/user-guide/tables/ related to adding/removing rows is blank, and I haven't been able to find much in the way of examples showing what I'm looking to do. There's lots of good examples on columnar work, but not much row-wise that I've been able to find. It's like I need an appendRow() function that took in a string/array/list/etc row and appended it to the table.

Seems like TableSaw is geared for an "insert-once, analyze-many" approach whereas my use case is an "insert-many, analyze-many" situation, so I want to be efficient in my approach. I like the flexibility & built-in analytics TableSaw has, so I wouldn't need to start from scratch with a custom approach.

Any strategy suggestions would be greatly appreciated!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/199, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXgvOUs0gY8MZ_Nq4b0D9mJ_Zbi3PWks5s2zbFgaJpZM4Qfa28 .

numericOverflow commented 6 years ago

OK, so what I'm doing seems very cumbersome. I'm aware I'm kind of misusing TableSaw, but it seems like there's got to be a better way to set individual values (Cells?) in the table:

//Create 3 arrays of length=1, one for each Column to be created below
List<String> Col1Vals = Arrays.asList("Col1Row1");
List<String> Col2Vals = Arrays.asList("Col2Row1");
double[] Col3Vals = new double[1];
Col3Vals[0] = 5.0;

//Create 3 columns, from each single value array created above
CategoryColumn C1 = new CategoryColumn("COL1",Col1Vals);
CategoryColumn C2 = new CategoryColumn("COL2",Col2Vals);
DoubleColumn C3 = new DoubleColumn("COL3",Col3Vals);

//Now build the table from all 3 columns
Table t = Table.create("TBL1",C1,C2,C3);

//Output summary info about the table we just created to prove init data was loaded
traceln(t.summary());

I've been poking around the API, and I can't seem to find a good way to set or update an individual row/column value. That brings me to some questions:

  1. If I want to reuse this 1-row table and update values, how would I go about doing that?
  2. Is there some way to insert values into all 3 columns for the same row at once (ie like a traditional database insert)?
  3. If there's no way to insert a single record all-at-once, can I update each column's value individually after creating the initial table columns?
  4. Is there a way to create a table from a CSV in-memory string, instead of a CSV file? (figuring it's more efficient to create a CSV string variable and pass to the constructor instead of writing it to disk, only to read it back in )

Thanks!

benmccann commented 6 years ago

You first 6 lines could be condensed to three:

CategoryColumn C1 = new CategoryColumn("COL1", new String[] { "Col1Row1" });
CategoryColumn C2 = new CategoryColumn("COL2", new String[] { "Col2Row1" });
DoubleColumn C3 = new DoubleColumn("COL3", new double[] {  5.0 });
  1. Example: table.doubleColumn(0).set(2, 7.0);
  2. I don't think so currently
  3. Yes, see 1
  4. Yes. With 0.11.1 it's as easy as Table.read().csv(exampleString, "tableName")
benmccann commented 6 years ago

@numericOverflow I updated my answer for 4 to be simpler

numericOverflow commented 6 years ago

@benmccann - Thanks for adding that CSV string function, it's much cleaner! I'll download the latest release and play with it.

lwhite1 commented 6 years ago

i'm wondering if this can be closed. I think the behavior of Tablesaw is pretty good for these things:

Tablesaw is not good for inserting records in the middle of a table, or deleting them from the middle one-at-a-time. Deleting a batch of records is not too bad.

This is the nature of a column-oriented structure. Improving on that would be a bunch of work. We'd have to make a hybrid column/row-based structure like some of the more advanced big data stores.

numericOverflow commented 6 years ago

@lwhite1 - I'd agree, this ticket can be closed. Your examples show the best possible (mis)use of TableSaw to do what I wanted. I wouldn't expect a major re-write to improve at this time, given the original true intention of TableSaw.

Thanks for the help & adding support for direct CSV string import.