Open tischi opened 2 years ago
Table loading speed can be improved by passing the ColumnTypes to the ReadOptionsBuilder. This will let it skip the column-type detection stage.
On Tue, Oct 18, 2022 at 5:44 AM Christian Tischer @.***> wrote:
Hi,
We are trying to load this table https://raw.githubusercontent.com/mobie/spatial-transcriptomics-example-project/main/data/pos42/tables/transcriptome/default.tsv into TableSaw.
We downloaded above file onto a SSD disk and are running this code:
final String tableSource = "/Users/tischer/Desktop/default.tsv"; System.out.println("Table source: " + tableSource ); builder = CsvReadOptions.builder( new File( tableSource ) ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ); start = System.currentTimeMillis(); Table.read().usingOptions( builder ); System.out.println("Build Table from File [ms]: " + ( System.currentTimeMillis() - start ));
This takes around 1600 ms.
Do you have any suggestions for how to potentially speed this up? We are also open to storing the table in another file format if that would help.
Thank you very much!
— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAQBSBFAFFFNXEIIUF3WDZWOBANCNFSM6AAAAAARH4PNEA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you!
I changed the code like this:
final ColumnType[] types = new ColumnType[ 9 ];
for ( int i = 0; i < types.length; i++ )
types[ i ] = ColumnType.STRING;
builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );
...but it did not increase the parsing speed. Maybe I am doing something wrong?
String column type uses dictionary encoding. Using it for things like random numbers (like x, y, and z) will create a lot of wasted objects. Try creating the table using appropriate types for each of the columns and see if it helps. Probably, only geneid should be a String.
If you really want everything as a single string type, use TextColumn (if it's in the version you're using), but ints and floats would be much better from a memory usage perspective.
On Tue, Oct 18, 2022 at 10:14 AM Christian Tischer @.***> wrote:
Thank you!
I changed the code like this:
final ColumnType[] types = new ColumnType[ 9 ]; for ( int i = 0; i < types.length; i++ ) types[ i ] = ColumnType.STRING; builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );
...but it did not increase the parsing speed. Maybe I am doing something wrong?
— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1162#issuecomment-1282467214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAVL6XFOJRZXYJXIWI3WD2WDJANCNFSM6AAAAAARH4PNEA . You are receiving this because you commented.Message ID: @.***>
I tried but it does not seem to speed it up.
I have to admit I am very confused about the benchmarking:
start = System.currentTimeMillis();
builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );
final Table rows = Table.read().usingOptions( builder );
System.out.println("Parse Table from String [ms]: " + ( System.currentTimeMillis() - start ));
start = System.currentTimeMillis();
builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types );
final Table rows2 = Table.read().usingOptions( builder );
System.out.println("Parse Table from String [ms]: " + ( System.currentTimeMillis() - start ));
yields:
Parse Table from String [ms]: 1686
Parse Table from String [ms]: 246
I find this very confusing. Are you doing any sort of caching there? But how? I mean how would the code be able to know that it is parsing two times the same data. Or is it the Java compiler that sees that this is twice the same data?
Do you have any experience with this?
it's the standard behavior of code when you're measuring how long it takes to run. The JIT will generally be much faster on subsequent runs. This is why benchmarks generally run a block of code repeatedly to "warm" it before checking the time.
The data isn't being cached; the code executes faster.
On Tue, Oct 18, 2022 at 11:20 AM Christian Tischer @.***> wrote:
I tried but it does not seem to speed it up.
I have to admit I am very confused about the benchmarking:
start = System.currentTimeMillis(); builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types ); final Table rows = Table.read().usingOptions( builder ); System.out.println("Parse Table from String [ms]: " + ( System.currentTimeMillis() - start )); start = System.currentTimeMillis(); builder = CsvReadOptions.builderFromString( tableString ).separator( '\t' ).missingValueIndicator( "na", "none", "nan" ).columnTypes( types ); final Table rows2 = Table.read().usingOptions( builder ); System.out.println("Parse Table from String [ms]: " + ( System.currentTimeMillis() - start ));
yields:
Parse Table from String [ms]: 1686 Parse Table from String [ms]: 246
I find this very confusing. Are you doing any sort of caching there? But how? I mean how would the code be able to know that it is parsing two times the same data. Or is it the Java compiler that sees that this is twice the same data?
Do you have any experience with this?
— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1162#issuecomment-1282569351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FPAUMRYFQAEU4DYQQL7LWD2537ANCNFSM6AAAAAARH4PNEA . You are receiving this because you commented.Message ID: @.***>
I made another test, where I just read a table with only two rows!
Table source: /Users/tischer/Desktop/default_regions.tsv
Build Table from File [ms]: 1206
Table source: /Users/tischer/Desktop/default_regions.tsv
Build Table from File [ms]: 12
This is extreme ;-)
Is that the JIT building all the code for parsing tables during the first go?
If so, do you have any experience with multi-threading in that regard? To me this suggests that it could in fact be better to read many tables rather sequentially to give the JIT a chance to compile the code?!
Yes, it's the JVM loading the code from the libraries, and the JIT compiler optimizing it. As far as I know there is no easy way to speed this up (and it's already highly optimized). This is why code performance is a particularly tricky topic in Java...
I gave a try to your file in parquet.
Build Table from csv [ms]: 1164
Write Table to parquet [ms]: 904
Build Table from parquet [ms]: 181
It looks much faster in parquet, but the JVM is already warm when I read the parquet back.
Build Table from csv [ms]: 1226
Build Table from csv [ms]: 225
Build Table from csv [ms]: 208
Build Table from csv [ms]: 209
Build Table from parquet [ms]: 1554 Build Table from parquet [ms]: 85 Build Table from parquet [ms]: 68 Build Table from parquet [ms]: 69
While on a warm JVM reading the parquet file is consistently faster (because it's a binary format), on a cold JVM it is actually slower, probably because there is more code to load and/or optimize.
You can see code loading taking its toll on the first run if you compare the parquet reader log to the externally timed operation:
DEBUG: Finished reading 100541 rows from default.tsv.parquet in 969 ms Build Table from parquet [ms]: 1554
Context is very important for performance considerations. Hope this helps.
Hi,
We are trying to load this table into TableSaw.
We downloaded above file onto a SSD disk and are running this code:
This takes around 1600 ms.
Do you have any suggestions for how to potentially speed this up? We are also open to storing the table in another file format if that would help.
Thank you very much!