simongog / sdsl-lite

Succinct Data Structure Library 2.0
Other
2.21k stars 350 forks source link

Construction in RAM #20

Closed simongog closed 11 years ago

simongog commented 11 years ago

Most temporary results during the construction process are stored and reread from disk. This reduces the memory footprint of the construction. However for small inputs, reading and writing to disk produces an overhead compared to in-memory processing.

I would like to solve this by implementing a simple RAM-file system. It would be necessary to encapsulate the file IO operations to be able to use the RAM-file system. RAM-files should start with a prefix like "RAM://"...

mpetri commented 11 years ago

would it not be easier to just "read"/"write" everything to an c++ iostream that is "backed" by a buffer instead of a real file? then the temp_write_read_buffer and other results could be written to file or that "fake file buffer" without changing the code much? in the construction methods you could depending on the input size create a real temp file or a iostream backed by a buffer.

simongog commented 11 years ago

Yes, we have to exchange the iostreams. But this alone does not solve the problem. Since in the current process information is written out to files with specific names. sa, lcp . These files identified with the specific names are then used again to build other stuff. So, I don't see how you can do this only by using streams.

tb38 commented 11 years ago

Although the idea is nice, I doubt that this effort is worth. I think it doesn't matter, if construction takes 1 or 2 seconds - but if it takes 1 h or 1 day. But with RAM-file one would probably only have an improvement on the first, right?

mpetri commented 11 years ago

I think the main idea would be if somebody specifies ram://sa_12313 as the destination of where the sa is saved, it is kept in ram instead. If you have a powerful machine (lots of ram) it could take 20mins to write the sa to some network drive even if you have enough memory available.

simongog commented 11 years ago

Ok, the implementation is almost finished. It will work as follow: Files which name start with @ will be loaded/save to/from RAM. So you can still use store_to_file, load_from_file and so on. There are two new streams

which are used to realize the handling. They are inherited from istream and ostream and have the same interface as ifstream and ofstream. (So I could easily replace the old against the new streams). Internally they have two buffers: a stringbuf and a filebuf, the first is selected for RAM files the second for disk files.

The only other change was that std::remove, which deletes files on disk, was replaced by sdsl::remove which will delete files on disk or RAM.

@tb38 : The in-memory construction is handy in the following scenario. Think you want to construct 100,000 CSAs of size 1000 from one concatenated sequence. Like in the document retrieval framework of Sadakane. Then the several disk seeks off the current construction would dominate the runtime.

simongog commented 11 years ago

I have to change the signature of the constructors of the compressed LCP arrays. They no longer take int_vector_file_buffers but cache_configs. This makes it possible to decide where temporary files during construction should be placed. The same temporary file issue occurs in some wavelet trees.

tb38 commented 11 years ago

But construct_lcp methods use already cache_configs?

simongog commented 11 years ago

I have not tested it yet intensively. The old tests passed and the added example works fine for me.