Optimization SparseBinary-Loading

GoogleCodeExporter commented 9 years ago

Hi,

when loading large SparseBinaryVector files the profiler shows a lot of calls 
of RandomAccessFile.read(). The reason is because RandomAccessFile.readInt() 
leads to 4 native calls of read() and RandomAccessFile.readDouble() to 8 calls.

In OnDiskSemanticSpace.loadSparseBinaryOffsets() both read-functions are used 
to seek over the vector.

In loadSparseBinaryVector each dimension is loaded at once.

I attached a patch which optimized this using a bytebuffer.

Additionally I attached a patch for build.xml to avoid this warning while 
building sspace on Windows.

"warning: unmappable character for encoding Cp1252"

Original issue reported on code.google.com by keepal...@gmail.com on 21 Jun 2011 at 11:02

Attachments:

GoogleCodeExporter commented 9 years ago

This looks great!  Any idea how much speed up you see from switching to using a 
ByteBuffer?

Also, I'm having trouble telling what the difference is the build.xml.  There's 
no patch information.  That Windows build warning certainly is annoying, so I 
would definitely like to have it suppressed.

Original comment by David.Ju...@gmail.com on 22 Jun 2011 at 8:47

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Benchmarking loadSparseBinaryOffsets():

new CachingOnDiskSemanticSpace("700mb.sspace");

786747ms -> 6797ms
115 times faster

Benchmarking loadSparseBinaryVector() without loading and same file:

for (Iterator<String> iterator = sspace.getWords().iterator(); 
iterator.hasNext();) {
    String word = iterator.next();
    sspace.getVector(word);
}

813002ms -> 72718ms
11 times faster

Original comment by keepal...@gmail.com on 22 Jun 2011 at 12:13

Added labels: ****
Removed labels: ****

Attachments:

build.patch

sunandap / airhead-research

Optimization SparseBinary-Loading #96