Requested array size exceeds VM limit

maarten-keijzer commented 7 years ago

Trying to use TableSaw 0.9.0 for a Kaggle competition, https://www.kaggle.com/c/santander-product-recommendation. When trying to load the training csv file (2.1 GB on disk) with 12GB of heap, I get:

Table.read().csv(trainFile);

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:77) at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:165) at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:203) at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:38) at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:30) at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:18) at Test.main(Test.java:12)

It seems that TableSaw is trying to allocate a byte array the size of the entire dataset to read the file. The file is bigger than the int limit, so cannot be loaded. The TableSaw README states that it successfully can load in a 35GB file, so I wonder if this is a regression?

lwhite1 commented 7 years ago

Hi. Thank you very much for trying Tablesaw and reporting this issue. Sorry you're having trouble.

I downloaded the dataset from your competition to see what I can learn, and it looks indeed like you're seeing a bug. By my count there are fewer than 14 million rows in that table.

In general, It is certainly possible to get that error message as we use standard arrays which are limited to a maximum of 2,147,483,647 elements. Here's some info on the broader issue: https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit that I'm including mostly for future reference.

Also, the size of the file you can load depends greatly on the nature of the file. For some data types, the "compression" ratio versus a CSV file is much better than others, but again this doesn't seem to be your issue, more of an FYI.

lwhite1 commented 7 years ago

Recreated issue in 0.9.1-SNAPSHOT

maarten-keijzer commented 7 years ago

Maybe you can suggest a work-around? I'm currently trying to do this by hand, using the CSVReader and create a tablesaw Table myself, but I think I have to specify each column type manually. Is there some code that guesses column types from a data sample somewhere in the library?

All in all, the library seems to be extremely useful, so I'd love to use it. Just starting though.

lwhite1 commented 7 years ago

You could normally use CsvReader.printColumnTypes(). This will print out the types as a String in the form of a java array that you could edit to correct any incorrect guesses. You could then pass it as a parameter in the CSV reading method. However, I think you'll hit the same or another bug doing that.

I will try to have a fix on maven by tomorrow US time. Apologies for the inconvenience.

benmccann commented 7 years ago

Sorry about that. I just pushed a fix. Do you guys want to test out the fix before we release a new version with the fix?

lwhite1 commented 7 years ago

yes. I added another fix for a column type guessing issue and am running a test now

On Sun, Aug 27, 2017 at 11:56 AM, Ben McCann notifications@github.com wrote:

Sorry about that. I just pushed a fix. Do you guys want to test out the fix before we release a new version with the fix?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/157#issuecomment-325206980, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXgrU65asRt-FcDyFDijAUpo4GFetAks5scZGQgaJpZM4PDwxs .

benmccann commented 7 years ago

I just pushed one final cleanup

lwhite1 commented 7 years ago

pulling and re-testing.

lwhite1 commented 7 years ago

Ok, I think the current code will get the job done. The file is tricky to parse because the data is sparse so the column type estimates aren't perfect. This is how to address:

First, use printColumnTypes to get the string representation of the estimated column types.

        String result =
                CsvReader.printColumnTypes("train_ver2.csv", true, ',');
        System.out.println(result);

Then edit the types to correct any mistakes. I'm not sure the version below is correct, but (after two corrections) the file does load.

    ColumnType[] columnTypes = {
            LOCAL_DATE, // 0     fecha_dato
            FLOAT,      // 1     ncodpers
            CATEGORY,   // 2     ind_empleado
            CATEGORY,   // 3     pais_residencia
            CATEGORY,   // 4     sexo
            CATEGORY,   // 5     age
            LOCAL_DATE, // 6     fecha_alta
            FLOAT,      // 7     ind_nuevo
            CATEGORY,   // 8     antiguedad
            FLOAT,      // 9     indrel
            CATEGORY,   // 10    ult_fec_cli_1t
            CATEGORY,   // 11    indrel_1mes
            CATEGORY,   // 12    tiprel_1mes
            CATEGORY,   // 13    indresi
            CATEGORY,   // 14    indext
            CATEGORY,   // 15    conyuemp
            CATEGORY,   // 16    canal_entrada
            CATEGORY,   // 17    indfall
            FLOAT,      // 18    tipodom
            FLOAT,      // 19    cod_prov
            CATEGORY,   // 20    nomprov
            FLOAT,      // 21    ind_actividad_cliente
            FLOAT,      // 22    renta
            CATEGORY,   // 23    segmento
            SHORT_INT,  // 24    ind_ahor_fin_ult1
            SHORT_INT,  // 25    ind_aval_fin_ult1
            SHORT_INT,  // 26    ind_cco_fin_ult1
            SHORT_INT,  // 27    ind_cder_fin_ult1
            SHORT_INT,  // 28    ind_cno_fin_ult1
            SHORT_INT,  // 29    ind_ctju_fin_ult1
            SHORT_INT,  // 30    ind_ctma_fin_ult1
            SHORT_INT,  // 31    ind_ctop_fin_ult1
            SHORT_INT,  // 32    ind_ctpp_fin_ult1
            SHORT_INT,  // 33    ind_deco_fin_ult1
            SHORT_INT,  // 34    ind_deme_fin_ult1
            SHORT_INT,  // 35    ind_dela_fin_ult1
            SHORT_INT,  // 36    ind_ecue_fin_ult1
            SHORT_INT,  // 37    ind_fond_fin_ult1
            SHORT_INT,  // 38    ind_hip_fin_ult1
            SHORT_INT,  // 39    ind_plan_fin_ult1
            SHORT_INT,  // 40    ind_pres_fin_ult1
            SHORT_INT,  // 41    ind_reca_fin_ult1
            SHORT_INT,  // 42    ind_tjcr_fin_ult1
            SHORT_INT,  // 43    ind_valo_fin_ult1
            SHORT_INT,  // 44    ind_viv_fin_ult1
            FLOAT,      // 45    ind_nomina_ult1
            FLOAT,      // 46    ind_nom_pens_ult1
            SHORT_INT,  // 47    ind_recibo_ult1
    };

Then load the table using your column types:

Table table1 = 
    Table.read().csv(CsvReadOptions.builder("train_ver2.csv").columnTypes(columnTypes));

lwhite1 commented 7 years ago

Ben, will you have time to deploy to Maven Central today? Thanks.

On Sun, Aug 27, 2017 at 12:19 PM, Ben McCann notifications@github.com wrote:

I just pushed one final cleanup

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/157#issuecomment-325208416, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXgozHmbsfqGq8_QfmfORC_-uEhXNNks5scZcRgaJpZM4PDwxs .

benmccann commented 7 years ago

Sure. Releasing 0.9.1 now

lwhite1 commented 7 years ago

Awesome. Thanks.

On Sun, Aug 27, 2017 at 1:47 PM Ben McCann notifications@github.com wrote:

Closed #157 https://github.com/jtablesaw/tablesaw/issues/157.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/157#event-1222859170, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXgjz__nIunZekudTfIqjbx9PtwB6Dks5scau9gaJpZM4PDwxs .

jtablesaw / tablesaw

Requested array size exceeds VM limit #157