jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.56k stars 645 forks source link

Increase table loading speed possible? #1162

Open tischi opened 2 years ago

tischi commented 2 years ago

Hi,

We are trying to load this table into TableSaw.

We downloaded above file onto a SSD disk and are running this code:

final String tableSource = "/Users/tischer/Desktop/default.tsv";
System.out.println("Table source: " + tableSource );
builder = CsvReadOptions.builder( new File( tableSource ) ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" );
start = System.currentTimeMillis();
Table.read().usingOptions( builder );
System.out.println("Build Table from File [ms]: " + ( System.currentTimeMillis() - start ));

This takes around 1600 ms.

Do you have any suggestions for how to potentially speed this up? We are also open to storing the table in another file format if that would help.

Thank you very much!

lwhite1 commented 2 years ago

Table loading speed can be improved by passing the ColumnTypes to the ReadOptionsBuilder. This will let it skip the column-type detection stage.

On Tue, Oct 18, 2022 at 5:44 AM Christian Tischer @.***> wrote:

Hi,

We are trying to load this table https://raw.githubusercontent.com/mobie/spatial-transcriptomics-example-project/main/data/pos42/tables/transcriptome/default.tsv into TableSaw.

We downloaded above file onto a SSD disk and are running this code:

final String tableSource = "/Users/tischer/Desktop/default.tsv"; System.out.println("Table source: " + tableSource ); builder = CsvReadOptions.builder( new File( tableSource ) ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ); start = System.currentTimeMillis(); Table.read().usingOptions( builder ); System.out.println("Build Table from File [ms]: " + ( System.currentTimeMillis() - start ));

This takes around 1600 ms.

Do you have any suggestions for how to potentially speed this up? We are also open to storing the table in another file format if that would help.

Thank you very much!

— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAQBSBFAFFFNXEIIUF3WDZWOBANCNFSM6AAAAAARH4PNEA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

tischi commented 2 years ago

Thank you!

I changed the code like this:

final ColumnType[] types = new ColumnType[ 9 ];
for ( int i = 0; i < types.length; i++ )
    types[ i ] = ColumnType.STRING;
builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );

...but it did not increase the parsing speed. Maybe I am doing something wrong?

lwhite1 commented 2 years ago

String column type uses dictionary encoding. Using it for things like random numbers (like x, y, and z) will create a lot of wasted objects. Try creating the table using appropriate types for each of the columns and see if it helps. Probably, only geneid should be a String.

If you really want everything as a single string type, use TextColumn (if it's in the version you're using), but ints and floats would be much better from a memory usage perspective.

On Tue, Oct 18, 2022 at 10:14 AM Christian Tischer @.***> wrote:

Thank you!

I changed the code like this:

final ColumnType[] types = new ColumnType[ 9 ]; for ( int i = 0; i < types.length; i++ ) types[ i ] = ColumnType.STRING; builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );

...but it did not increase the parsing speed. Maybe I am doing something wrong?

— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1162#issuecomment-1282467214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAVL6XFOJRZXYJXIWI3WD2WDJANCNFSM6AAAAAARH4PNEA . You are receiving this because you commented.Message ID: @.***>

tischi commented 2 years ago

I tried but it does not seem to speed it up.

I have to admit I am very confused about the benchmarking:

        start = System.currentTimeMillis();
        builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );
        final Table rows = Table.read().usingOptions( builder );
        System.out.println("Parse Table from String [ms]: " + ( System.currentTimeMillis() - start ));

        start = System.currentTimeMillis();
        builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );
        final Table rows2 = Table.read().usingOptions( builder );
        System.out.println("Parse Table from String [ms]: " + ( System.currentTimeMillis() - start ));

yields:

Parse Table from String [ms]: 1686
Parse Table from String [ms]: 246

I find this very confusing. Are you doing any sort of caching there? But how? I mean how would the code be able to know that it is parsing two times the same data. Or is it the Java compiler that sees that this is twice the same data?

Do you have any experience with this?

lwhite1 commented 2 years ago

it's the standard behavior of code when you're measuring how long it takes to run. The JIT will generally be much faster on subsequent runs. This is why benchmarks generally run a block of code repeatedly to "warm" it before checking the time.

The data isn't being cached; the code executes faster.

On Tue, Oct 18, 2022 at 11:20 AM Christian Tischer @.***> wrote:

I tried but it does not seem to speed it up.

I have to admit I am very confused about the benchmarking:

  start = System.currentTimeMillis();
  builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );
  final Table rows = Table.read().usingOptions( builder );
  System.out.println("Parse Table from String [ms]: " + ( System.currentTimeMillis() - start ));

  start = System.currentTimeMillis();
  builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );
  final Table rows2 = Table.read().usingOptions( builder );
  System.out.println("Parse Table from String [ms]: " + ( System.currentTimeMillis() - start ));

yields:

Parse Table from String [ms]: 1686 Parse Table from String [ms]: 246

I find this very confusing. Are you doing any sort of caching there? But how? I mean how would the code be able to know that it is parsing two times the same data. Or is it the Java compiler that sees that this is twice the same data?

Do you have any experience with this?

— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1162#issuecomment-1282569351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAUMRYFQAEU4DYQQL7LWD2537ANCNFSM6AAAAAARH4PNEA . You are receiving this because you commented.Message ID: @.***>

tischi commented 2 years ago

I made another test, where I just read a table with only two rows!

Table source: /Users/tischer/Desktop/default_regions.tsv
Build Table from File [ms]: 1206
Table source: /Users/tischer/Desktop/default_regions.tsv
Build Table from File [ms]: 12

This is extreme ;-)

Is that the JIT building all the code for parsing tables during the first go?

If so, do you have any experience with multi-threading in that regard? To me this suggests that it could in fact be better to read many tables rather sequentially to give the JIT a chance to compile the code?!

ccleva commented 2 years ago

Yes, it's the JVM loading the code from the libraries, and the JIT compiler optimizing it. As far as I know there is no easy way to speed this up (and it's already highly optimized). This is why code performance is a particularly tricky topic in Java...

I gave a try to your file in parquet.

Build Table from parquet [ms]: 1554 Build Table from parquet [ms]: 85 Build Table from parquet [ms]: 68 Build Table from parquet [ms]: 69

While on a warm JVM reading the parquet file is consistently faster (because it's a binary format), on a cold JVM it is actually slower, probably because there is more code to load and/or optimize.

You can see code loading taking its toll on the first run if you compare the parquet reader log to the externally timed operation:

DEBUG: Finished reading 100541 rows from default.tsv.parquet in 969 ms Build Table from parquet [ms]: 1554


Context is very important for performance considerations. Hope this helps.