thomasmoelhave / tpie

Templated Portable I/O Environment
Other
112 stars 24 forks source link

Feature request: Serialization stream #39

Closed Mortal closed 11 years ago

Mortal commented 11 years ago

We should have a new serialization framework that supports some kind of file format. I am currently implementing new serialization and a serialization_stream in the serialization branch.

It should support pipelining, stream reversing, and sorting.

Mortal commented 11 years ago

A bit more research has led me to a Stack Overflow answer on overloads/specializations across namespaces.

It seems the right way to enable serialization for user types is to provide a template specialization in the user's namespace, and for the parts of TPIE that use serialization to find the specialization with ADL.

Mortal commented 11 years ago

What should we do when opening a stream for reverse reading and finding out it was not written in reverse (or vice versa)? Ignore the open flag and use the stream order; issue a warning; throw an exception?

antialize commented 11 years ago

Perhaps only use the reverse flag when opening for writing, and document that it is only for writing.

thomasmoelhave commented 11 years ago

@antialize are you proposing basically that the stream is "written" in reverse (as far as the user knows) and then whenever that stream is opened it will automatically be read backwards (which it'll know do do by the reverse flag)? That sounds like a decent idea. Alternatively, if "reverse" needs to be specified both for reading and writing I recommend throwing an exception if there's a mismatch.

antialize commented 11 years ago

Actualy we have decided changing the interface so that we have 4 classes instead of one: serialization_writer, serialization_reader, serialization_reverse_writer, serialization_reverse_reader.

This way more errors can be caught in the type system (since a stream cannot be opened for reading and writing at the same time) and the code can be made simpler and more efficient.

Mortal commented 11 years ago

I have reconsidered the serialization stream class hierarchy. In short:

class serialization_header {
public:
    serialization_header(file_accessor::raw_file_accessor &);
    header_size(); // Reserve this many bytes in file for the header.
    // For reading a header:
    read();
    get_size();
    get_clean_close();
    verify();
    // For writing a header:
    set_size(sz); // if not set, writes size=0 in header.
    write(bool cleanClose);
};

class serialization_writer_base {
protected:
    open(path, reverse);
    write_block(buffer, bufSize); // always writes next block
    close();
};

class serialization_writer : public serialization_writer_base {
public:
    open(path);
    serialize(...);
    close();
};

(as well as a reverse writer and forwards/reverse reader.)

Mortal commented 11 years ago

@antialize Please review serialization_stream.h@7836ffcb7 and let me know what you think.

Mortal commented 11 years ago

I have implemented reverse reading and writing in 1e727a49a.

Mortal commented 11 years ago

I have implemented sorting. I have tested correctness by comparing the output with LC_ALL=C /usr/bin/sort. I still need to handle memory correctly, but at least there is an example usage in the test/sort program.

thomasmoelhave commented 11 years ago

How good is the performance of the sorting? Is it very dependent on the size of the items being serialized? If for some reason all elementes were, say, ints - how does the sorting compare to a standard file_stream?

Mortal commented 11 years ago

That's a good question - I'll look into that on Tuesday. So far I have tested scanning and reversing against file_stream, and the CPU overhead is significant when reversing, but not so in ordinary scanning, and when I use actual disks instead of a RAM disk I have not observed a difference.

I'm a bit wary of comparing the two sorters before I can make sure I only use the memory I am allotted in the new sorter, but I'll write a test program for that nonetheless.

Mortal commented 11 years ago

I have run a single speed comparison test pitting tpie::sort against the new tpie::serialization_sorter on identical 50 GB inputs. The test program outputs 21 fractiles from a stream of numbers (which it has to sort), and I'm happy to see that the two programs output the same fractiles and so are probably equally correct.

Both runs have an input scan overhead, because the input numbers are written to a stream by a different program beforehand. The two stream types are different, but the numbers are generated from the same random seed.

Both runs have an output scan overhead. For tpie::sort this is due to the API; for serialization_stream this is due to me not handling the final merge step correctly yet.

Both runs took roughly the same time: 80 minutes (tpie::sort 80:03.3, serialization_sort 80:48.8).

I will fix the output scan overhead in the serialization sorter and then run a test against the new pipelining merge_sorter which does not have this output scan overhead.

antialize commented 11 years ago

Would you create a pipeline wrapper for serialization sort?

Mortal commented 11 years ago

I have implemented evacuation for the serialization sorter. It is tested by ut-serialization_sort which uses the same test case parameters (phase 1 mem/phase 2 mem/phase 3 mem/extra working space/input size) as the ut-merge_sort test (which I wrote for the ordinary pipelining merge sorter).

Mortal commented 11 years ago

I think it is time to merge serialization into master.

Jakob and I have rebased the serialization onto the current tip of master, and the result is the serialization_rebase branch.

It includes a change to the pipelining merge sorter to handle empty input better.

This is the diffstat:

 doc/CMakeLists.txt                               |    3 +
 doc/Doxyfile.in                                  |    1 +
 doc/code/code.cpp                                |    7 +
 doc/code/serialization.inl                       |   76 ++
 doc/serialization.dox.in                         |  101 +++
 test/CMakeLists.txt                              |    8 +
 test/lines.cpp                                   |   46 ++
 test/sort.cpp                                    |   80 ++
 test/speed_regression/CMakeLists.txt             |   16 +
 test/speed_regression/fractile.h                 |  111 +++
 test/speed_regression/fractile_serialization.cpp |   65 ++
 test/speed_regression/fractile_tpie.cpp          |   53 ++
 test/speed_regression/numbergen.cpp              |  125 +++
 test/speed_regression/serialization.cpp          |  342 +++++++++
 test/unit/CMakeLists.txt                         |    6 +-
 test/unit/merge_sort.h                           |  195 +++++
 test/unit/test_merge_sort.cpp                    |  173 +----
 test/unit/test_pipelining_serialization.cpp      |  204 +++++
 test/unit/test_serialization.cpp                 |  281 ++++++-
 test/unit/test_serialization_sort.cpp            |   45 ++
 tpie/CMakeLists.txt                              |    4 +
 tpie/is_simple_iterator.h                        |   54 ++
 tpie/pipelining.h                                |    2 +
 tpie/pipelining/CMakeLists.txt                   |    1 +
 tpie/pipelining/merge_sorter.h                   |    8 +-
 tpie/pipelining/serialization.h                  |  221 ++++++
 tpie/pipelining/serialization_sort.h             |  536 +++++++++++++
 tpie/serialization2.h                            |  268 +++++++
 tpie/serialization_sort.h                        |  890 ++++++++++++++++++++++
 tpie/serialization_stream.cpp                    |  395 ++++++++++
 tpie/serialization_stream.h                      |  402 ++++++++++
 31 files changed, 4557 insertions(+), 162 deletions(-)
thomasmoelhave commented 11 years ago

That works for me, this is probably a good time to do this merge.

freekvw commented 11 years ago

Shouldn't we wait until we've finished the release?

antialize commented 11 years ago

We point to a specific commit, and if an issue should arise, we can alwayes branch from there.

freekvw commented 11 years ago

Was just thinking that's not so pretty, but yeah, whatever :-)

Mortal commented 11 years ago

Merged in 51f4da6.