jankotek / mapdb

MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java database engine.
https://mapdb.org
Apache License 2.0
4.89k stars 873 forks source link

Out of Memory while using HTreeMap to populate a List #927

Closed JavSelAB closed 5 years ago

JavSelAB commented 5 years ago

I am trying to use HTreeMap from mapDB to populate a List of million entries in a CSV file, but every time I insert data into the List, the previous entry is overwritten for the HTreeMap.

The only way to avoid overwriting an HTreeMap in the final List of values is to create another DB connection of a hashmap, but with this solution, there is an exception of Java Heap Size.

Is there a subtle way to use HTreeMap in reading a million plus records and adding it to a List without duplicating the data? My code is as below...

public GapList<HTreeMap<String, Object>> fn_ReadCSV_GapListHTMap(File fileCSV) {

   BufferedReader bfrdrCSVReader = null;
   String strLine = "";
  String[] arrHeaders;

  //Gaplist is used for collecting the data read as map from the CSV.
  GapList<HTreeMap<String, Object>> glhtmapReadCSV = new GapList<>();
  try {

   bfrdrCSVReader = new BufferedReader(new FileReader(fileCSV));

   //reading header for the .csv file, which by default is the first line of file.
   String headerLine = bfrdrCSVReader.readLine();
   arrHeaders = headerLine.split(",");

   //using mapDB to read voluminous data from the CSV which is in tunes of a million.
    DB dbReadCSV = DBMaker.memoryDB().closeOnJvmShutdown().make();

    HTreeMap<String, Object> htmapLineData = (HTreeMap<String, Object>) dbReadCSV.hashMap("htmapLineData").keySerializer(Serializer.STRING).expireMaxSize(25).createOrOpen();

   //read each line of the .csv file.
   while((strLine = bfrdrCSVReader.readLine()) != null) {

       //intCSVLine ++;
       String[] arrTokens = strLine.split(",",-1);

       //When I used hashmap, I reset it here after adding the read data to
       //the list, but this type of behavior can't be done for HTreeMap.
       //Map<String, Object> mapLineData = new HashMap<>();

        //as stated in the problem statement, one needs to create new instance of DB
        //such that the new HTreeMap initialized later points to new memory location
        //and when finally added to the list, doesn't duplicate the data in the list.
        dbReadCSV = DBMaker.memoryDB().closeOnJvmShutdown().make();

        HTreeMap<String, Object> htmapLineData = (HTreeMap<String, Object>) dbReadCSV.hashMap("htmapLineData").keySerializer(Serializer.STRING).expireMaxSize(25).createOrOpen();

       for(int intLineNum = 0; intLineNum < arrHeaders.length; intLineNum++) {

           //based on the header read, read each value for that header & add to the map.
           htmapLineData.put(arrHeaders[intLineNum].trim(), arrTokens[intLineNum].trim());
       }
       //once a map for a line read is created, add it to the final list of entries.
       glhtmapReadCSV.add(htmapLineData);

       //The below code creates an issue wherein the entire DB connection is closed
       //and the error states it to be "com.sun.jdi.InvocationException occurred invoking method."
       //resulting into a corrupt list of data.

       //closing the DBMaker to enable refreshing of the HTreeMap.
       //dbReadCSV.close();
   }

   bfrdrCSVReader.close();
   }
   catch(Exception exceptionCSVReader) {       

   StringWriter stack = new StringWriter();
   exceptionCSVReader.printStackTrace(new PrintWriter(stack));
   log.debug("DEBUG: The exception while reading CSV file is: "+stack);
   assertTrue(false, "ERROR: CSV file can't be read; hence exiting with exception !");
   }

   return glhtmapReadCSV;
}
JavSelAB commented 5 years ago

@jankotek , can you please put some light to this behavior ? As in a subtle way to reset the HTreeMap w/o closing or re-initializing the DBMaker ?

JavSelAB commented 5 years ago

@jankotek any update on this behavior ?