Construction process improvements

simongog / sdsl-lite

Succinct Data Structure Library 2.0

Other

2.2k stars 349 forks source link

Construction process improvements #28

Open ghost opened 11 years ago

ghost commented 11 years ago

Working with the library, I feel that I'm constrained by the construction workflow.

My understanding of the construction workflow is:

Make an instance of the desired structure, with the default constructor (which should do nothing).
Call the construct function, which (at least for WTs) (1) loads the file into an int_vector. (2) Writes that int_vector to disc. (3) Creates a new instance of the desired structure by calling its constructor with an int_vector_file_buffer. (4) Uses swap to swap the new structure with the one passed to the construct function. (5) Deletes the temporary file containing the int_vector structure.

I have multiple concerns with this workflow:

Copying the file into an int_vector and then storing this structure on disc seams rather wasteful both in terms of I/O time, but also in terms of disc space (especially if one uses succinct structures because one works on very large data sets).
The int_vector_file_buffer is not very flexible. Seek functionality and the possibility to set 'limits' could be useful. By limits I mean that one could give a start and an end offset, after which the buffer would appear to only contain the values in between.
It is only possible to affect the structure at run-time, before construction, by the use of template variables.

Simple solutions to these concerns are possible, E.g. (1) one could simply not write the int_vector to disc (as it is now, it is completely loaded into memory anyway). (2) The extension to int_vector_file_buffer is the most extensive code work of these suggestions, but should be decoupled enough to only involve the int_vector_file_buffer implementation. (3) Though affecting the library more, the library interface could rely on a "construct" function in every structure instead of a constructor. This would save memory during construction, since we wouldn't need the temporary object, and make it possible to change the parameters of the structure at run time, before the actual construction. This change should be extremely simple, but it affects many structures.

So, what are your thoughts on the suggested changes? I think they would give more flexibility, and at the same time save some resources at construction time.

mpetri commented 11 years ago

I think simon put a lot of effort into removing all the "construct()" methods from all the data structures with this new version.

I agree that int_vector_file_buffer is not very flexible. What I would want is a more configurable way (similar to the config/cache config stuff is already used for csa/cst construction) applied to all structures. I also don't like that that there is no "configureable" way to construct the lcp structures.

To me a "configurable" way to construct any of the structures with a "reasonable" default configuration (say space efficient put as much as possible on disk) and an easy way to construct everything in memory. Making things configurable on the other hand makes using the library more complicated especially for new users.

simongog commented 11 years ago

Hi Jacob,

the goal of the construct method is providing an easy and flexible interface for the end user and is -- as you stated -- not optimal. Let me explain it.

Step (1) which loads the file into and int_vector is the normalization step, since here you can specify if you input sequence of bytes, 32-bit words,...
Step (2) stores this normalized data to disk OR in memory (see construct_im / ram_fs). I agree that this is an unnecessary step, if the original data is already in int_vector format and this could be included easily ;)
Step (3): Each class has exactly one constructor which takes data (in case of the WTs passed by an int_vector_file_buffer). The goal of this is to simplify code maintenance. I also agree that one could extend the functionality of the int_vector_file_buffer and your suggestions are very welcome. The reason why seeking and so on was not added is that I had not yet have an use case for it.
Step (4): Here I do not yet understand what you want to optimize. The swap operation is a constant time and constant size operation (only pointers are copied) and used in popular C++ ideoms (like the copy and swap ideom).

About your third concern in the list: Did you meant that it is only possible to affect the structures at compile-time?

ghost commented 11 years ago

I must admit that I haven't completely understood the int_vector. What does it offer more than ease of indexing other word sizes than a byte? And when you say "is already in int_vector format" what is meant? <-- Leave it at that, if It is a long explanation. I should look thoroughly at the code then :)

I agree that the swap method does not take much memory, when the structures use pointers to the heap. What if a structure initializes large static (based on template variables) arrays (not pointers to the heap)? Is this an unlikely scenario?

My use for the extra functionality in int_vector_file_buffer is when I want to construct each block of the wt_fixed_block. If I could restrict the int_vector_file_buffer temporarily so that it appears to only contain a sub-string of its real size, I could simply pass it to the block wts. Right now I work around by creating temporary files for each block. I'll possibly look into this enhancement then if there are no objections.

So the only real question left; It's only possible to affect the structures at compile-time by using template variables. Since the constructor is fixed by the interface, and the construct function uses the swap method, there is no way to affect the object (e.g. setting the block size) at run-time before the structure has been constructed (at which time it is too late to set the block size). Would't it be useful if there were some standard approach in the construction flow, that would allow parameters of the structure to be set at run-time but before the structure is constructed?

simongog commented 11 years ago

I'll come back to your first paragraph later. About swap: No structure in the library has large static arrays as members but only a constant number of pointers or members which are also library structures. So you would have to generate this scenario.

You can just use the in-memory construction (construct_im) for each block. This will create a in-memory file (have a look in ram_fs.hpp and construct.hpp) and does not interact with disk at all.

About the compile-time issue. The block-size parameter has to a template-parameter. Yes, I know at the first glance this is not so handy sometimes. For example for testing and benchmarking with different parameters, since you have to write a script which compiles the program with different parameter values and so on. But have a look at the benchmark code, it is not so hard to do. So why do we have to do this? Well, if you only think of your class than your solution makes perfectly sense. However we will have problems in the following scenario: Someone wants to use two WTs in a more complex class. He specifies the type of the WTs as template parameter. The question is now: How does the signature of the constructor of this class look like? Well, it depends on the signatur of the WT type's constructor. So we need one common interface for all WTs and the only solution I can think of is to pass the block value as template parameter.... Would be happy, if there is a more flexible solution for this...

ghost commented 11 years ago

I couldn't see how ram_fs would work, but after a long time i noticed that it was osfstream, isfstream and not ofstream, ifstream....

Yes, I can definitely benefit from using stringbuf, and I will, thank you! (the enhancements of int_vector_file_buffer will of course still save a lot of memory though).

About initialization: Yes the interface would have to be the same (the scenario you describe is exactly wt_fixed_block), and from mpetris answer I understand that in was a design choice to use constructors and swap as opposed to a construct function in each class. So with this solution, the only way I can think of is to let the construct take a config parameter, which it forwards to the construct of the structure. This parameter would either be a class specific to the structure, inherited from a top "config" type (for type checking), or it could be boxed which woudn't be as pretty, but maybe more efficient?? It should ofc. be an optional parameter, so that the 'cleanness' from the construct function isn't broken.