Add ability for COPY to read directly from gzip input files

billmaimone commented 7 years ago

Instead of requiring zip'd files to be decompressed before loading to COPY streamline the data loading process by reading directly from a gzip'd file. This would be more convenient, not take up intermediate storage for the expanded files. Possibly there's opportunity/benefit to doing this in parallel. Other notes:

Taking the Boost route would also make it trivial to add bzip2 support: http://www.boost.org/doc/libs/1_58_0/libs/iostreams/doc/classes/bzip2.html Looks like neither pigz nor pbzip2 are currently available as libraries, so it would be a little bit of work to support parallel decompress. Might not be worth it, would have to find a good balance between decompress speed and mem usage.
It'd be helpful to log aborted files from the set in such a way that they could be handled individually.

tmostak commented 7 years ago

Contributions from the community are welcome on this task!

fleapapa commented 7 years ago

pigz seems not supporting parallel decompression:

Decompression can't be parallelized, at least not without specially prepared deflate streams for that purpose.

Neither does pigz2, unless...

Files that are compressed with pbzip2 will also gain considerable speedup when decompressed using pbzip2.

Files that were compressed using bzip2 will not see speedup since bzip2 packages the data into a single chunk that cannot be split between processors.

fleapapa commented 7 years ago

If parallelized decompression is not required, a patch below may be an option. In my test, it seems working with .gz and .bz2 files. I tried to reuse as much existing code (of Importer::importDelimited) as possible.

$ git status|grep modi|grep Import|awk '{print $2}'|xargs git diff
diff --git a/Import/CMakeLists.txt b/Import/CMakeLists.txt
old mode 100644
new mode 100755
index b56aa23..e250c10
--- a/Import/CMakeLists.txt
+++ b/Import/CMakeLists.txt
@@ -1,5 +1,5 @@
 add_library(CsvImport Importer.cpp Importer.h)
-target_link_libraries(CsvImport mapd_thrift Shared Catalog Chunk DataMgr StringDictionary poly2tri ${GDAL_LIBRARIES} ${Glog_LIBRARIES} ${CMAKE_DL_LIBS} ${Arrow_LIBRARIES})
+target_link_libraries(CsvImport mapd_thrift Shared Catalog Chunk DataMgr StringDictionary poly2tri ${GDAL_LIBRARIES} ${Glog_LIBRARIES} ${CMAKE_DL_LIBS} ${Arrow_LIBRARIES} boost_iostreams)

 install(DIRECTORY ${CMAKE_SOURCE_DIR}/ThirdParty/gdal-data DESTINATION "ThirdParty")
 add_custom_target(gdal-data ALL COMMAND ${CMAKE_COMMAND} -E copy_directory "${CMAKE_SOURCE_DIR}/ThirdParty/gdal-data" "${CMAKE_BINARY_DIR}/ThirdParty/gdal-data")
diff --git a/Import/Importer.cpp b/Import/Importer.cpp
old mode 100644
new mode 100755
index 44bddee..23bb737
--- a/Import/Importer.cpp
+++ b/Import/Importer.cpp
@@ -52,6 +52,13 @@

 #include <arrow/api.h>

+//ppan+ allow COPY from gzip decompressed files
+#include <boost/iostreams/filtering_streambuf.hpp>
+#include <boost/iostreams/copy.hpp>
+#include <boost/iostreams/filter/gzip.hpp>
+#include <boost/iostreams/filter/bzip2.hpp>
+#include "tbb/tbb.h"
+
 using std::ostream;

 namespace Importer_NS {
@@ -1800,28 +1807,89 @@ void Importer::load(const std::vector<std::unique_ptr<TypedImportBuffer>>& impor

 ImportStatus Importer::import() {
   // TODO(andrew): add file type detection
-  return importDelimited();
+
+  //ppan+ allow copy from .gz and .bz2
+  using namespace std;
+  using namespace boost::iostreams;
+  if (boost::filesystem::extension(file_path) == ".gz" || boost::filesystem::extension(file_path) == ".bz2")
+  {
+    // create a decompression filter
+    ifstream file(file_path, ios_base::in | ios_base::binary);
+    filtering_streambuf<input> in;
+
+    if (boost::filesystem::extension(file_path) == ".bz2")
+      in.push(bzip2_decompressor());
+    else
+      in.push(gzip_decompressor());
+
+    in.push(file);
+
+    // create a named pipe to write decompressed output
+    std::stringstream ss;
+    ss << std::this_thread::get_id();
+    string pipe_path = string("/tmp/mapdp_") + ss.str();
+
+    unlink(pipe_path.data());
+    if (mkfifo(pipe_path.data(), 0660) < 0)
+      throw std::runtime_error("mkfifo failure for '" + pipe_path + "': " + strerror(errno));
+
+    ImportStatus ret;
+    tbb::parallel_for(0, 2, [&](const int ith)
+    {
+      try
+      {
+        if (0 == ith)
+        {
+          // reader
+          if (0 == (p_file = fopen(pipe_path.data(), "rb")))
+            throw std::runtime_error("fopen failure for '" + pipe_path + "': " + strerror(errno));
+
+          ret = importDelimited(pipe_path, true);
+        }
+        else
+        {
+          // writer end
+          ofstream out(pipe_path, ofstream::binary);
+          boost::iostreams::copy(in, out);
+        }
+
+      }
+      catch (...)
+      {}
+    });
+
+    if (p_file) fclose(p_file);
+    unlink(pipe_path.data());
+    return ret;
+  }
+
+  return importDelimited(file_path);
 }

 #define IMPORT_FILE_BUFFER_SIZE 100000000  // 100M file size
 #define MIN_FILE_BUFFER_SIZE 50000         // 50K min buffer

-ImportStatus Importer::importDelimited() {
+ImportStatus Importer::importDelimited(const std::string& file_path, const bool decompressed) {
   bool load_truncated = false;
   set_import_status(import_id, import_status);
-  p_file = fopen(file_path.c_str(), "rb");
+  if (!p_file) p_file = fopen(file_path.c_str(), "rb");
   if (!p_file) {
     throw std::runtime_error("fopen failure for '" + file_path + "': " + strerror(errno));
   }
-  (void)fseek(p_file, 0, SEEK_END);
-  file_size = ftell(p_file);
+
+  if (!decompressed)
+  {
+    (void)fseek(p_file, 0, SEEK_END);
+    file_size = ftell(p_file);
+  }
+
   if (copy_params.threads == 0)
     max_threads = sysconf(_SC_NPROCESSORS_CONF);
   else
     max_threads = copy_params.threads;
   // deal with small files
   size_t alloc_size = IMPORT_FILE_BUFFER_SIZE;
-  if (file_size < alloc_size) {
+  if (!decompressed && file_size < alloc_size) {
     alloc_size = file_size;
   }
   buffer[0] = (char*)checked_malloc(alloc_size);
@@ -1856,14 +1924,21 @@ ImportStatus Importer::importDelimited() {
       end_pos = size;
     else
       end_pos = find_end(buffer[which_buf], size, copy_params);
+
+    // unput residual
+    int nresidual = size - end_pos;
+    std::unique_ptr<char> unbuf(nresidual > 0? new char[nresidual]: nullptr);
+    if (unbuf) memcpy(unbuf.get(), buffer[which_buf] + end_pos, nresidual);
+
     if (size <= alloc_size) {
       max_threads = std::min(max_threads, (int)std::ceil((double)(end_pos - begin_pos) / MIN_FILE_BUFFER_SIZE));
     }
     if (max_threads == 1) {
       import_status += import_thread(0, this, buffer[which_buf], begin_pos, end_pos, end_pos);
       current_pos += end_pos;
-      (void)fseek(p_file, current_pos, SEEK_SET);
-      size = fread((void*)buffer[which_buf], 1, IMPORT_FILE_BUFFER_SIZE, p_file);
+//      if (!decompressed) (void)fseek(p_file, current_pos, SEEK_SET);
+      memcpy(buffer[which_buf], unbuf.get(), nresidual);
+      size = nresidual + fread(buffer[which_buf] + nresidual, 1, IMPORT_FILE_BUFFER_SIZE - nresidual, p_file);
       if (size < IMPORT_FILE_BUFFER_SIZE && feof(p_file))
         eof_reached = true;
     } else {
@@ -1876,8 +1951,9 @@ ImportStatus Importer::importDelimited() {
       }
       current_pos += end_pos;
       which_buf = (which_buf + 1) % 2;
-      (void)fseek(p_file, current_pos, SEEK_SET);
-      size = fread((void*)buffer[which_buf], 1, IMPORT_FILE_BUFFER_SIZE, p_file);
+//      if (!decompressed) (void)fseek(p_file, current_pos, SEEK_SET);
+      memcpy(buffer[which_buf], unbuf.get(), nresidual);
+      size = nresidual + fread(buffer[which_buf] + nresidual, 1, IMPORT_FILE_BUFFER_SIZE - nresidual, p_file);
       if (size < IMPORT_FILE_BUFFER_SIZE && feof(p_file))
         eof_reached = true;
       for (auto& p : threads)
diff --git a/Import/Importer.h b/Import/Importer.h
old mode 100644
new mode 100755
index 59e3df0..2d22d36
--- a/Import/Importer.h
+++ b/Import/Importer.h
@@ -677,7 +677,7 @@ class Importer {
   Importer(Loader* providedLoader, const std::string& f, const CopyParams& p);
   ~Importer();
   ImportStatus import();
-  ImportStatus importDelimited();
+  ImportStatus importDelimited(const std::string& file_path, const bool decompressed = false);
   ImportStatus importShapefile();
   ImportStatus importGDAL(std::map<std::string, std::string> colname_to_src);
   const CopyParams& get_copy_params() const { return copy_params; }

fleapapa commented 7 years ago

[Update] The path below is much simpler (w/o booststreamio and intel tbb) but should allow parallelized decompression if the source file is prepared with parallelism in mind.

COPY time of 10M rows is not better from parallelism because bottleneck on my VM is the slow HDD.

$ time echo "copy trips from '/mnt/tmp/4.bz2' with (header='false', max_reject=100);" | /mnt/mapd-core/bin/mapdql mapd     -u $MAPD_USERNAME     -p $MAPD_PASSWORD
User mapd connected to database mapd
Result
Loaded: 10000000 recs, Rejected: 0 recs in 74.348000 secs
User mapd disconnected from database mapd

real    1m14.554s
user    0m0.004s
sys 0m0.000s

Patch:

$ git diff Import/Importer.cpp
diff --git a/Import/Importer.cpp b/Import/Importer.cpp
old mode 100644
new mode 100755
index 44bddee..fb47aef
--- a/Import/Importer.cpp
+++ b/Import/Importer.cpp
@@ -1800,28 +1800,47 @@ void Importer::load(const std::vector<std::unique_ptr<TypedImportBuffer>>& impor

 ImportStatus Importer::import() {
   // TODO(andrew): add file type detection
-  return importDelimited();
+
+  //ppan+ allow copy from .gz and .bz2
+  if (boost::filesystem::extension(file_path) == ".gz" || boost::filesystem::extension(file_path) == ".bz2")
+  {
+    using namespace std;
+    string cmd = boost::filesystem::extension(file_path) == ".bz2"? "pbzip2": "pigz";
+    cmd += " -d -c ";
+    cmd += file_path;
+    if(0 == (p_file = popen(cmd.data(), "r")))
+      throw std::runtime_error(cmd + " failure: " + strerror(errno));
+
+    return importDelimited(file_path, true);
+  }
+
+  return importDelimited(file_path);
 }

 #define IMPORT_FILE_BUFFER_SIZE 100000000  // 100M file size
 #define MIN_FILE_BUFFER_SIZE 50000         // 50K min buffer

-ImportStatus Importer::importDelimited() {
+ImportStatus Importer::importDelimited(const std::string& file_path, const bool decompressed) {
   bool load_truncated = false;
   set_import_status(import_id, import_status);
-  p_file = fopen(file_path.c_str(), "rb");
+  if (!p_file) p_file = fopen(file_path.c_str(), "rb");
   if (!p_file) {
     throw std::runtime_error("fopen failure for '" + file_path + "': " + strerror(errno));
   }
-  (void)fseek(p_file, 0, SEEK_END);
-  file_size = ftell(p_file);
+
+  if (!decompressed)
+  {
+    (void)fseek(p_file, 0, SEEK_END);
+    file_size = ftell(p_file);
+  }
+
   if (copy_params.threads == 0)
     max_threads = sysconf(_SC_NPROCESSORS_CONF);
   else
     max_threads = copy_params.threads;
   // deal with small files
   size_t alloc_size = IMPORT_FILE_BUFFER_SIZE;
-  if (file_size < alloc_size) {
+  if (!decompressed && file_size < alloc_size) {
     alloc_size = file_size;
   }
   buffer[0] = (char*)checked_malloc(alloc_size);
@@ -1856,14 +1875,21 @@ ImportStatus Importer::importDelimited() {
       end_pos = size;
     else
       end_pos = find_end(buffer[which_buf], size, copy_params);
+
+    // unput residual
+    int nresidual = size - end_pos;
+    std::unique_ptr<char> unbuf(nresidual > 0? new char[nresidual]: nullptr);
+    if (unbuf) memcpy(unbuf.get(), buffer[which_buf] + end_pos, nresidual);
+
     if (size <= alloc_size) {
       max_threads = std::min(max_threads, (int)std::ceil((double)(end_pos - begin_pos) / MIN_FILE_BUFFER_SIZE));
     }
     if (max_threads == 1) {
       import_status += import_thread(0, this, buffer[which_buf], begin_pos, end_pos, end_pos);
       current_pos += end_pos;
-      (void)fseek(p_file, current_pos, SEEK_SET);
-      size = fread((void*)buffer[which_buf], 1, IMPORT_FILE_BUFFER_SIZE, p_file);
+//      if (!decompressed) (void)fseek(p_file, current_pos, SEEK_SET);
+      memcpy(buffer[which_buf], unbuf.get(), nresidual);
+      size = nresidual + fread(buffer[which_buf] + nresidual, 1, IMPORT_FILE_BUFFER_SIZE - nresidual, p_file);
       if (size < IMPORT_FILE_BUFFER_SIZE && feof(p_file))
         eof_reached = true;
     } else {
@@ -1876,8 +1902,9 @@ ImportStatus Importer::importDelimited() {
       }
       current_pos += end_pos;
       which_buf = (which_buf + 1) % 2;
-      (void)fseek(p_file, current_pos, SEEK_SET);
-      size = fread((void*)buffer[which_buf], 1, IMPORT_FILE_BUFFER_SIZE, p_file);
+//      if (!decompressed) (void)fseek(p_file, current_pos, SEEK_SET);
+      memcpy(buffer[which_buf], unbuf.get(), nresidual);
+      size = nresidual + fread(buffer[which_buf] + nresidual, 1, IMPORT_FILE_BUFFER_SIZE - nresidual, p_file);
       if (size < IMPORT_FILE_BUFFER_SIZE && feof(p_file))
         eof_reached = true;
       for (auto& p : threads)
@@ -1916,7 +1943,10 @@ ImportStatus Importer::importDelimited() {
   buffer[0] = nullptr;
   free(buffer[1]);
   buffer[1] = nullptr;
-  fclose(p_file);
+  if (decompressed)
+    pclose(p_file);
+  else
+    fclose(p_file);
   p_file = nullptr;

   import_status.load_truncated = load_truncated;

dwayneberry commented 7 years ago

@fleapapa thanks for your effort here.

The patch seemed incomplete, and did not compile. I added code for the headers and created this patch

0001-Use-pigz-or-pbzip2-as-options-on-the-copy.patch.gz

It seems that if you do not have pigz of pbzip2 installed it just quietly fails and reports no message. Could you add an exception process to detect if the piped applications are not available?

fleapapa commented 7 years ago

@dwayneberry I'll add the exception process. Originally i thought pigz or pbzip2 should be added to the script/mapd-deps-xxx.sh to assure its existence. However, the script seems for development environment only. The two utilities may not exist on deployed hosts. So i'm thinking if i need to also include the code in the previous patch which doesn't depend on pigz/pbzip2, or will there be a deployment script that can assure existence of pigz/pbzip2?

billmaimone commented 7 years ago

It's a bit painful to read from diffs. How about a branch?

asuhan commented 7 years ago

Aren't these tools also available as libraries? ~~At least for parallel bzip2 I've found http://lbzip2.org/~~ It's still a standalone executable, unfortunately. I think the integration would be a lot better if we didn't have to spawn processes.

dwayneberry commented 7 years ago

If we could use libs rather than applications that would be nicer as the dependecies can be resolve at build time and shipped with the code

But if we have to use applications we can add the requirements to yum and apt etc but at the end of day in case it slips through the cracks we still need a meaningful error message. Are these applications available on OS X, best if we put out a meaningful error message

asuhan commented 7 years ago

Also, if there are no library versions of the parallel compression tools, let's use https://github.com/facebook/folly/blob/master/folly/Subprocess.h. We depend on Folly anyway.

dwayneberry commented 7 years ago

These tools in general only run parallel on compression not decompression if my memory serves me correctly so we don't need the fancier parallel implementations when really we are only decompressing

fleapapa commented 7 years ago

I was not aware of lbzip2 library. Will need to look into it to see how to integrate it to existing code. Overall, using utility commands is not preferred, is it?

asuhan commented 7 years ago

It's definitely cleaner to link to a library as opposed to spawning a process, but if it's not possible we have folly::Subprocess to assist. Anyway, lbzip2 doesn't look like a library either.

dwayneberry commented 7 years ago

If we can avoid running command line processes it would be a win for simplicity in the long run

fleapapa commented 7 years ago

If lbzip2 is not a library, which of lbzip2 and pbzip2 to choose?

Forking a process to run the utility seems a goal that can be easily achieved in short term. Having folly::subprocess to run the utility is the next phase.

Does it make sense?

Frankly I have no idea yet how much efforts to integrate them in library forms.

fleapapa commented 7 years ago

@billmaimone It's a bit painful to read from diffs. How about a branch?

A branch is better, but i'm in the middle of an attempt to optimize Importer::importDelimited(). The function seems not fully parallelized because all the threads created therein need to join before moving on to read next block (of IMPORT_FILE_BUFFER_SIZE bytes).

After the optimization i will need to clean up my branch, including removing the experimental alternative file format that move all "page headers" to front of files.

asuhan commented 7 years ago

Looks like @dwayneberry is correct and neither can parallelize decompression. The man page of pbzip2 says: Files that were compressed using bzip2 will not see speedup since bzip2 packages the data into a single chunk that cannot be split between processors.. We're better off with coarse parallelization (one worker per file) using the regular versions.

The additional effort of using them as libraries is well worth it, it's not significantly harder than spawning processes. But if we have to spawn processes at least we should do it with some abstractions like the ones in Folly.

dwayneberry commented 7 years ago

Just ran a quick test of a 25G tweetmap file 60m records (compressed gz file is 11G).

uncompresses 647 secs compressed 660 secs

Basically the single thread decompress is far from the bottleneck here

fleapapa commented 7 years ago

@dwayneberry, I guess your test was done on HDD. On my HDD i got similar thruput (10m rows @ 60 secs). But I guess, on SSD and NVMe the effect of parallelized decompression should be more obvious, because the storage will be less a bottleneck.

After optimizing threading model of Importer::importDelimited(), unfortunately i got only 10 seconds faster on COPY the same .bz2 file (4.bz2) of 10m rows. Even though i prepared the file using pbzip2 so the file should be decompressed in parallel by pbzip2 itself, but the optimization is most likely offset by the bigger bottleneck -- slow IOPS of HDD.

$ time echo "copy trips from '/mnt/tmp/4.bz2' with (header='false', max_reject=100);" | /mnt/mapd-core/bin/mapdql mapd     -u $MAPD_USERNAME     -p $MAPD_PASSWORD
User mapd connected to database mapd
Result
Loaded: 10000000 recs, Rejected: 0 recs in 63.392000 secs
User mapd disconnected from database mapd

real    1m3.835s (vs. 1m14s previously) 
user    0m0.004s
sys 0m0.000s

Also, performance pbzip2 seems not as good as pigz, because COPY of the same set of 10m rows in .gz format is significant faster, as show below.

$ time echo "copy trips from '/mnt/tmp/2.gz' with (header='false', max_reject=100);" | /mnt/mapd-core/bin/mapdql mapd     -u $MAPD_USERNAME     -p $MAPD_PASSWORD
User mapd connected to database mapd
Result
Loaded: 10000000 recs, Rejected: 0 recs in 44.714000 secs
User mapd disconnected from database mapd

real    0m45.100s
user    0m0.000s
sys 0m0.004s

Of course, performance of uncompressed data is the best of all, as shown below. However, it's not far ahead of .gz, since slow IOPS is still the main factor.

$ time echo "copy trips from '/mnt/tmp/9t.csv'  with (header='false', max_reject=100);" | /mnt/mapd-core/bin/mapdql mapd     -u $MAPD_USERNAME     -p $MAPD_PASSWORD
User mapd connected to database mapd
Result
Loaded: 10000000 recs, Rejected: 0 recs in 41.609000 secs
User mapd disconnected from database mapd

real    0m42.070s
user    0m0.000s
sys 0m0.008s

I will need some time to clean up my local branch first then check in the code, but i'm afraid that before i figure out how to integrate folly subprocess there will be only popen + pbzip2/pigz + exception (fallback to boostiostream).

fleapapa commented 7 years ago

Just found that a .bz2 file compressed by pbzip2 can't be decompressed by boostiostream.

$ time echo "copy trips from '/mnt/tmp/4.bz2' with (header='false', max_reject=100);" | /mnt/mapd-core/bin/mapdql mapd     -u $MAPD_USERNAME     -p $MAPD_PASSWORD
User mapd connected to database mapd
Exception: bzip2 error: iostream error
User mapd disconnected from database mapd

Also, for the same 10M rows in traditional .bz2 format, boostiostream is no worse than pbzip2.

$ time echo "copy trips from '/mnt/tmp/2.bz2' with (header='false', max_reject=100);" | /mnt/mapd-core/bin/mapdql mapd     -u $MAPD_USERNAME     -p $MAPD_PASSWORD
User mapd connected to database mapd
Result
Loaded: 10000000 recs, Rejected: 0 recs in 71.315000 secs
User mapd disconnected from database mapd

real    1m11.594s
user    0m0.008s
sys 0m0.000s

asuhan commented 7 years ago

Yes, the man page is pretty clear in that regard:

Files  that are compressed with pbzip2 are broken up into pieces and each individual piece is compressed.  This is how pbzip2 runs faster on multiple CPUs since the pieces can be compressed simultaneously.

Since no one is going to use pbzip2 to compress, there's little practical point in using it on decompression side. We should stick to the single-threaded versions.

fleapapa commented 7 years ago

If i understand/remember it correctly, pbzip2/pigz can decompress files regardless if the files were created by pbzip2/pigz or not. So i think it's good to first try pbzip2/pigz just in case the files were created with parallelism; then fall back to single-threaded Boostiostream decompression. If users have pbzip2/pigz-encoded files, they must install pbzip2/pigz.

fleapapa commented 7 years ago

I created a PR for this issue. Besides pbzip2/pigz and fallback to boost::iostreams, i changed the function Importer::importDelimited significantly, trying to parallelizing starting point of 2nd half of data path a little further.

The new code tries to keep _maxthreads threads at all time. If i understand old code correctly, in every cycle it read a 10MB block, cut the block to _maxthreads slices, dispatch the slices to the threads and join them before starting another cycle, so not always _maxthreads threads runs concurrently.

If something goes wrong with the new code, which i haven't seen so far, or if the 15% gain is not worth the non-trivial change, simply restoring old Importer::importDelimited code should still work.

tmostak commented 7 years ago

Hi @fleapapa on a related note can we also handled zipped files? And will this handle multiple files zipped/tarred?

fleapapa commented 7 years ago

@tmostak, as i just tested, pigz can uncompress .zip files. I'll add '.zip' file extension to the code to make it work. For tarred/zipped file, i'm not sure if extending the command pipe in popen() will work. I'll investigate it, as well as with Boost::iostream, and update here later.

fleapapa commented 7 years ago

I added support of .tar, .tgz and .zip. Not sure if i can create another PR before you guys approve the previous PR on path a67de9a

Note: Boost:iostream can't work with tar/tgz/zip. It will no more crash mapd core but fail with an error message.

fleapapa commented 7 years ago

Hi @asuhan, i searched mapd-core code base but found no example of using folly::Subprocess except in folly dir itself where i couldn't find relevant example. For my use case I needs to fetch the stdout file descriptor right after construction of a Subprocess object that executes a command like "pigz -d foo", and create a FILE* from the descriptor so that i can reuse existing code path directly.

After object construction the subprocess instance is in RUNNING state, but somehow fetching stdin descriptor fails an assertion that looks unusual or absurd. I'll keep digging, but if you or anyone happen to have relevant example code to share it will great.

Using folly communicate() will need quite efforts to reroute the piped data and code.

asuhan commented 7 years ago

If folly::Subprocess is not adequate then don't use it. Nevertheless, if you need to spawn processes, you need something similar to it you can build yourself. The constructor can create the named pipe and spawn the process and the destructor can wait for the process and unlink the pipe.

I still think that we should use libraries here, not separate processes. There's not enough value in parallel decompression to justify the separate process acrobatics.

fleapapa commented 7 years ago

@asuhan, regarding "separate process acrobatics", i think popen() may have obvious impact on embedded systems w/o MMU but it seems unlikely users will run mapd on those systems.

Also, unlikely users will run COPY every few seconds or minutes, so any fork() overhead from popen() should be amortized quickly.

After double checking issue #104, I will run a test to verify my perception that one COPY (= one popen) per minute or per few seconds won't impact the system significantly on a normal Linux server. If my perception is wrong, i will adjust the code accordingly.

asuhan commented 7 years ago

I'm not worried at all about some nebulous overhead -- of course it's not that big of a deal to spawn a process in the current year. It's purely a software engineering / integration concern; it's better to not have to manage separate processes if we can. If we cannot avoid it, we should do it with some form of RAII.

fleapapa commented 7 years ago

Here is a status summary per each file format:

.gz/.bz2: using pigz/pbzip2 command first (can change to gzip/bzip2). Fallback to Boost:iostream library.
.zip: using pigz command (will change to unzip). No library fallback.
.tgz/.tar: using tar. No library fallback.

Please share any library for the latter two.

BTW, please elaborate "RAII". (= C++ RAII???)

asuhan commented 7 years ago

Yes, RAII means C++ RAII. Have you looked into libarchive? It supports a lot of formats and the license is friendly.

fleapapa commented 7 years ago

No, but I'm interested with its "Universal Decompressor" now. Hope it can make a minimalist happy.

fleapapa commented 7 years ago

@asuhan. I updated the PR with a solution based on libarchive (and its entire source code), but later i realized that i could just apt install libarchive-dev to use its binary library. Which is the preferred way?

dwayneberry commented 7 years ago

@fleapapa for something like this we would probably try to look at using binary library and static linking

Would it be possible to submit a PR based off the master branch rather than your various intermediate branches as the deltas are all over the place

fleapapa commented 7 years ago

@dwayneberry. Does it mean i need to switch my local checkout to the master branch of https://github.com/mapd/mapd-core rather than my forked master branch (https://github.com/fleapapa/mapd-core)?

I am not authorized to push directly to mapd's master branch.

dwayneberry commented 7 years ago

@fleapapa if you look in your PR you will see there 4 commits with the final commit overriding the earlier commits, and the first commit already applied to the code. I don't think you need to change your local checkout you just need to clean up your commits with some rebasing and squashing to make it easier for others to review what the final changes actually are.

fleapapa commented 7 years ago

@dwayneberry. I've been confused why the first commit is still dangling there, because i have rebased my local branch with mapd's master as last time you suggested me to sync up with mapd's changes. Let me integrate binary libarchive first, then see how to get rid of the dangling commit and squash those commits into one...

fleapapa commented 7 years ago

Today i downloaded NY taxi trip dataset to my new macbook. It's in a .7z file. I had to extract all the files because 7z is not supported in PR .

7z seems popular for archiving big datasets. libarchive supports 7z. Should i add 7z support and recreate the PR?

fleapapa commented 7 years ago

Another alternative is to enable ALL decoders of libarchive. That will increase code size by ~800KB but it will automatically detect file format and fall back to uncompressed format on unknown or uncompressed format. It also saves future revision overhead.

dwayneberry commented 7 years ago

closed with https://github.com/mapd/mapd-core/commit/572252db72a2a9bc6f9ea780e8267f18ce135a78

heavyai / heavydb

Add ability for COPY to read directly from gzip input files #105