heydavid525 / hotbox

BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Read multiple times on each atom file #6

Open whymoon opened 8 years ago

whymoon commented 8 years ago

there are 11 times read on atom file 0(shown below) and also other atom files. I think it should read the entire atom file in one time to reduce the I/O cost.

data_idx: 0. file_begin: 0. fileend: 6381536. Length: 6381536. next: 0. I1117 10:50:40.933832 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:50:40.933848 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:50:40.962638 5553 data_iterator.cpp:95] File Read: 6381536 I1117 10:50:41.118170 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 43307 I1117 10:50:41.396575 5553 data_iterator.cpp:44] Chunk Info: [0 - 43307) I1117 10:50:41.396630 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:50:43.710700 5553 data_iterator.cpp:33] data_idx: 1. file_begin: 6381536. fileend: 11576615. Length: 5195079. next: 43307. I1117 10:50:43.710754 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:50:43.710765 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:50:43.736438 5553 data_iterator.cpp:95] File Read: 5195079 I1117 10:50:43.865164 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 36049 I1117 10:50:44.108989 5553 data_iterator.cpp:44] Chunk Info: [43307 - 79356) I1117 10:50:44.109045 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:50:46.029036 5553 data_iterator.cpp:33] data_idx: 2. file_begin: 11576615. fileend: 17018504. Length: 5441889. next: 79356. I1117 10:50:46.029084 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:50:46.029096 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:50:46.059459 5553 data_iterator.cpp:95] File Read: 5441889 I1117 10:50:46.190789 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 38409 I1117 10:50:46.454730 5553 data_iterator.cpp:44] Chunk Info: [79356 - 117765) I1117 10:50:46.454782 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:50:48.496186 5553 data_iterator.cpp:33] data_idx: 3. file_begin: 17018504. fileend: 22504733. Length: 5486229. next: 117765. I1117 10:50:48.496251 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:50:48.496263 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:50:48.527498 5553 data_iterator.cpp:95] File Read: 5486229 I1117 10:50:48.660548 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 38182 I1117 10:50:48.928453 5553 data_iterator.cpp:44] Chunk Info: [117765 - 155947) I1117 10:50:48.928508 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:50:51.010025 5553 data_iterator.cpp:33] data_idx: 4. file_begin: 22504733. fileend: 28897819. Length: 6393086. next: 155947. I1117 10:50:51.010069 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:50:51.010082 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:50:51.048132 5553 data_iterator.cpp:95] File Read: 6393086 I1117 10:50:51.205715 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 44810 I1117 10:50:51.519893 5553 data_iterator.cpp:44] Chunk Info: [155947 - 200757) I1117 10:50:51.519948 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:50:53.923291 5553 data_iterator.cpp:33] data_idx: 5. file_begin: 28897819. fileend: 34780078. Length: 5882259. next: 200757. I1117 10:50:53.923352 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:50:53.923363 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:50:53.960167 5553 data_iterator.cpp:95] File Read: 5882259 I1117 10:50:54.105048 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 41363 I1117 10:50:54.398797 5553 data_iterator.cpp:44] Chunk Info: [200757 - 242120) I1117 10:50:54.398851 5553 data_iterator.cpp:45] ------------------------------------- num data 2396130data_slice_len: 226398735 I1117 10:50:56.881144 5553 data_iterator.cpp:33] data_idx: 6. file_begin: 34780078. fileend: 41353084. Length: 6573006. next: 242120. I1117 10:50:56.881240 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:50:56.881268 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:50:56.951113 5553 data_iterator.cpp:95] File Read: 6573006 I1117 10:50:57.181300 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 46759 I1117 10:50:57.587462 5553 data_iterator.cpp:44] Chunk Info: [242120 - 288879) I1117 10:50:57.587520 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:51:00.069231 5553 data_iterator.cpp:33] data_idx: 7. file_begin: 41353084. fileend: 47591873. Length: 6238789. next: 288879. I1117 10:51:00.069273 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:51:00.069284 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:51:00.113456 5553 data_iterator.cpp:95] File Read: 6238789 I1117 10:51:00.284709 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 44501 I1117 10:51:00.636641 5553 data_iterator.cpp:44] Chunk Info: [288879 - 333380) I1117 10:51:00.636705 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:51:03.054677 5553 data_iterator.cpp:33] data_idx: 8. file_begin: 47591873. fileend: 52650239. Length: 5058366. next: 333380. I1117 10:51:03.054734 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:51:03.054745 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:51:03.100880 5553 data_iterator.cpp:95] File Read: 5058366 I1117 10:51:03.243819 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 36049 I1117 10:51:03.530462 5553 data_iterator.cpp:44] Chunk Info: [333380 - 369429) I1117 10:51:03.530524 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:51:05.465898 5553 data_iterator.cpp:33] data_idx: 9. file_begin: 52650239. fileend: 58996233. Length: 6345994. next: 369429. I1117 10:51:05.465951 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:51:05.465962 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:51:05.522012 5553 data_iterator.cpp:95] File Read: 6345994 I1117 10:51:05.699026 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 46091 I1117 10:51:06.057494 5553 data_iterator.cpp:44] Chunk Info: [369429 - 415520) I1117 10:51:06.057560 5553 data_iterator.cpp:45] ------------------------------------- I1117 10:51:08.556501 5553 data_iterator.cpp:33] data_idx: 10. file_begin: 58996233. fileend: 65843202. Length: 6846969. next: 415520. I1117 10:51:08.556553 5553 data_iterator.cpp:55] Which Atom: [0 - 0]. I1117 10:51:08.556565 5553 data_iterator.cpp:59] Reading atom file 0 I1117 10:51:08.608809 5553 data_iterator.cpp:95] File Read: 6846969 I1117 10:51:08.792409 5553 data_iterator.cpp:104] atom_proto.datum_protos_size(): 47100 I1117 10:51:09.170083 5553 data_iterator.cpp:44] Chunk Info: [415520 - 462620)

holyglenn commented 8 years ago

The current implemention seeks to its position. Seek function is kindof cheap. The current implementation is also optimized with Zero_Copy_Stream, the major benefit is to through the data directly into compressor for decompression without creating a string, hence no copy.

Read one time and creating substrings would necessarily copy the string at least onc, even twice for the string() itself, and another allocation & copy for substr().

What's more, the application logic will be more complicated and bug prone.

My idea is to leave it be and see how the exp turn out.

zhangyy91 commented 8 years ago

To optimize IO, I have to group data_idx range within a atom file (maybe two, if some data range crosses two atom file), then read atom file to memory as a shared buffer and distribute it to different Transform Task where each task will decompresss part of the buffer and do transform. It is very complicated. And I am still debugging the multi-threaded transformer.

And if we read some data range across two atom file, we get two buffer and have to put them togoether which will involve memory copy.

One suggestion is to make atom file independent from each other. Since we will use HDFS. We should make atom file size less than (or equal to) the HDFS block size to avoid wasting space.