gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
463 stars 136 forks source link

Using gzip streams with the Jellyfish parser #47

Open rob-p opened 8 years ago

rob-p commented 8 years ago

Hi Guillaume,

We have a number of Sailfish and Salmon users asking for us to directly support gzipped fasta/q files. Right now, the only way to achieve this online (i.e. have sailfish or salmon read directly from the compressed file) is to use process substitution or named pipes. I like this solution, as I find it rather elegant (and it moves the decompression burden --- and the burden of supporting many different file formats --- off of the tool itself). However, this seems to be a very popular request, and the process substitution syntax does sometimes wreak havoc when it needs to be escaped (in qsub scripts or gnu-parallel commands etc.). Since we already rely on Boost, I was wondering if it is possible to use the Boost IOStream filters with the existing Jellyfish parser? For example, would it be possible to check the format of a file, if it is gzip, wrap it in the gzip decompression filter, and then pass it off to the Jellyfish parser somehow?

Thanks, Rob

gmarcais commented 8 years ago

Alright, let see what I can do to satisfy the vox populi. I made a quick change to the develop branch of JF so one can use the solution below. But maybe I should also have stream_manager support reading directly from std::istream.

As it stands, JF already support creating named pipes and running sub-processes attached to these named pipes. It is done through "generators". The built-in generator_manager class reads the commands to run from a text file, one command per line executed through the shell.

With the latest code from develop, one can easily create a new type of generator. Inherit from generator_manager_base and overload the virtual method get_cmd(). This method, every time it is called, would look at the next file, determine its type, and return a command to output this file as a fasta or fastq on its stdout. An example would be (untested):

#include <jellyfish/generator_manager.hpp>
#include <jellyfish/stream_manager.hpp>
#include <jellyfish/whole_sequence_parser.hpp>

typedef std::vector<const char*>                      file_vector;
typedef file_vector::const_iterator                   path_iterator;
typedef jellyfish::stream_manager<path_iterator>      stream_type;
typedef jellyfish::whole_sequence_parser<stream_type> parser_type;

class generator_auto : public jellyfish::generator_manager_base {
  const char**  argv_;
  const size_t argc_;
  size_t       index_;
public:
  generator_auto(const char* argv[], size_t argc, int nb_pipes = 1, const char* shell = "/bin/sh")
    : generator_manager_base(nb_pipes, shell)
    , argv_(argv)
    , argc_(argc)
    , index_(0)
  { }

  std::string get_cmd() {
    std::string command;
    if(index_ < argc_) { // Auto-detection of file type should go inside this if statement
      command  = "cat '";
      command += argv_[index_++];
      command += '\'';
    }
    return command;
  }
};

int main(int argc, char *argv[]) {
  generator_auto generator((const char**)(argv + 1), argc - 1);

  stream_type streams(generator.pipes().end(), generator.pipes().end(), // No regular files
                      generator.pipes().begin(), generator.pipes().end()); // Only pipes
  parser_type parser(64, 100, 1, streams);

  return 0;
}
rob-p commented 8 years ago

Thanks for the quick reply, and the code! I also think that having stream_manager support directly reading from std::istream would be a nice (additional) solution. It would enable using the parser to read from arbitrarily formatted files without the need to spawn off any sub-processes (if there's a reason that's desirable). Would the solution you have above be easy to integrate into a "paired end" parser? I suppose all you have to do is keep the pipes coming from the generator in sync, but the stream_type would probably have to be different.

gmarcais commented 8 years ago

I agree that this method, although very easy for me to implement (basically made one method virtual), is a little heavy weight. I'll implement reading from std::istream.

Regarding "paired end" parser. Suppose you have an external program called pe_merge that takes two fasta files and output the paired reads one after the other, then the get_cmd() could return, for each pair of files: pe_merge 'file1' 'file2'. If, in addition, you make sure that the number of sequence read in each group by the whole_sequence_parser class is even (2nd argument to constructor), then you are guaranteed to receive the pairs together (even read is first read, odd read is the mate pair).

rob-p commented 8 years ago

Hrmm --- that's a very interesting approach to having the above code support paired-end reads. Actually, such an approach might help unify code to support another feature request by users . . . interleaved fastq format! Before the user requests, I really did not know this format was in wide use in any way!