Closed maarten-keijzer closed 7 years ago
Hi. Thank you very much for trying Tablesaw and reporting this issue. Sorry you're having trouble.
I downloaded the dataset from your competition to see what I can learn, and it looks indeed like you're seeing a bug. By my count there are fewer than 14 million rows in that table.
In general, It is certainly possible to get that error message as we use standard arrays which are limited to a maximum of 2,147,483,647 elements. Here's some info on the broader issue: https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit that I'm including mostly for future reference.
Also, the size of the file you can load depends greatly on the nature of the file. For some data types, the "compression" ratio versus a CSV file is much better than others, but again this doesn't seem to be your issue, more of an FYI.
Recreated issue in 0.9.1-SNAPSHOT
Maybe you can suggest a work-around? I'm currently trying to do this by hand, using the CSVReader and create a tablesaw Table myself, but I think I have to specify each column type manually. Is there some code that guesses column types from a data sample somewhere in the library?
All in all, the library seems to be extremely useful, so I'd love to use it. Just starting though.
You could normally use CsvReader.printColumnTypes(). This will print out the types as a String in the form of a java array that you could edit to correct any incorrect guesses. You could then pass it as a parameter in the CSV reading method. However, I think you'll hit the same or another bug doing that.
I will try to have a fix on maven by tomorrow US time. Apologies for the inconvenience.
Sorry about that. I just pushed a fix. Do you guys want to test out the fix before we release a new version with the fix?
yes. I added another fix for a column type guessing issue and am running a test now
On Sun, Aug 27, 2017 at 11:56 AM, Ben McCann notifications@github.com wrote:
Sorry about that. I just pushed a fix. Do you guys want to test out the fix before we release a new version with the fix?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/157#issuecomment-325206980, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXgrU65asRt-FcDyFDijAUpo4GFetAks5scZGQgaJpZM4PDwxs .
I just pushed one final cleanup
pulling and re-testing.
Ok, I think the current code will get the job done. The file is tricky to parse because the data is sparse so the column type estimates aren't perfect. This is how to address:
First, use printColumnTypes to get the string representation of the estimated column types.
String result =
CsvReader.printColumnTypes("train_ver2.csv", true, ',');
System.out.println(result);
Then edit the types to correct any mistakes. I'm not sure the version below is correct, but (after two corrections) the file does load.
ColumnType[] columnTypes = {
LOCAL_DATE, // 0 fecha_dato
FLOAT, // 1 ncodpers
CATEGORY, // 2 ind_empleado
CATEGORY, // 3 pais_residencia
CATEGORY, // 4 sexo
CATEGORY, // 5 age
LOCAL_DATE, // 6 fecha_alta
FLOAT, // 7 ind_nuevo
CATEGORY, // 8 antiguedad
FLOAT, // 9 indrel
CATEGORY, // 10 ult_fec_cli_1t
CATEGORY, // 11 indrel_1mes
CATEGORY, // 12 tiprel_1mes
CATEGORY, // 13 indresi
CATEGORY, // 14 indext
CATEGORY, // 15 conyuemp
CATEGORY, // 16 canal_entrada
CATEGORY, // 17 indfall
FLOAT, // 18 tipodom
FLOAT, // 19 cod_prov
CATEGORY, // 20 nomprov
FLOAT, // 21 ind_actividad_cliente
FLOAT, // 22 renta
CATEGORY, // 23 segmento
SHORT_INT, // 24 ind_ahor_fin_ult1
SHORT_INT, // 25 ind_aval_fin_ult1
SHORT_INT, // 26 ind_cco_fin_ult1
SHORT_INT, // 27 ind_cder_fin_ult1
SHORT_INT, // 28 ind_cno_fin_ult1
SHORT_INT, // 29 ind_ctju_fin_ult1
SHORT_INT, // 30 ind_ctma_fin_ult1
SHORT_INT, // 31 ind_ctop_fin_ult1
SHORT_INT, // 32 ind_ctpp_fin_ult1
SHORT_INT, // 33 ind_deco_fin_ult1
SHORT_INT, // 34 ind_deme_fin_ult1
SHORT_INT, // 35 ind_dela_fin_ult1
SHORT_INT, // 36 ind_ecue_fin_ult1
SHORT_INT, // 37 ind_fond_fin_ult1
SHORT_INT, // 38 ind_hip_fin_ult1
SHORT_INT, // 39 ind_plan_fin_ult1
SHORT_INT, // 40 ind_pres_fin_ult1
SHORT_INT, // 41 ind_reca_fin_ult1
SHORT_INT, // 42 ind_tjcr_fin_ult1
SHORT_INT, // 43 ind_valo_fin_ult1
SHORT_INT, // 44 ind_viv_fin_ult1
FLOAT, // 45 ind_nomina_ult1
FLOAT, // 46 ind_nom_pens_ult1
SHORT_INT, // 47 ind_recibo_ult1
};
Then load the table using your column types:
Table table1 =
Table.read().csv(CsvReadOptions.builder("train_ver2.csv").columnTypes(columnTypes));
Ben, will you have time to deploy to Maven Central today? Thanks.
On Sun, Aug 27, 2017 at 12:19 PM, Ben McCann notifications@github.com wrote:
I just pushed one final cleanup
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/157#issuecomment-325208416, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXgozHmbsfqGq8_QfmfORC_-uEhXNNks5scZcRgaJpZM4PDwxs .
Sure. Releasing 0.9.1 now
Awesome. Thanks.
On Sun, Aug 27, 2017 at 1:47 PM Ben McCann notifications@github.com wrote:
Closed #157 https://github.com/jtablesaw/tablesaw/issues/157.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/157#event-1222859170, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXgjz__nIunZekudTfIqjbx9PtwB6Dks5scau9gaJpZM4PDwxs .
Trying to use TableSaw 0.9.0 for a Kaggle competition, https://www.kaggle.com/c/santander-product-recommendation. When trying to load the training csv file (2.1 GB on disk) with 12GB of heap, I get:
Table.read().csv(trainFile);
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.io.ByteArrayOutputStream.(ByteArrayOutputStream.java:77)
at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:165)
at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:203)
at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:38)
at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:30)
at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:18)
at Test.main(Test.java:12)
It seems that TableSaw is trying to allocate a byte array the size of the entire dataset to read the file. The file is bigger than the int limit, so cannot be loaded. The TableSaw README states that it successfully can load in a 35GB file, so I wonder if this is a regression?