Open rob-p opened 8 years ago
Alright, let see what I can do to satisfy the vox populi. I made a quick change to the develop branch of JF so one can use the solution below. But maybe I should also have stream_manager
support reading directly from std::istream
.
As it stands, JF already support creating named pipes and running sub-processes attached to these named pipes. It is done through "generators". The built-in generator_manager
class reads the commands to run from a text file, one command per line executed through the shell.
With the latest code from develop, one can easily create a new type of generator. Inherit from generator_manager_base
and overload the virtual method get_cmd()
. This method, every time it is called, would look at the next file, determine its type, and return a command to output this file as a fasta or fastq on its stdout. An example would be (untested):
#include <jellyfish/generator_manager.hpp>
#include <jellyfish/stream_manager.hpp>
#include <jellyfish/whole_sequence_parser.hpp>
typedef std::vector<const char*> file_vector;
typedef file_vector::const_iterator path_iterator;
typedef jellyfish::stream_manager<path_iterator> stream_type;
typedef jellyfish::whole_sequence_parser<stream_type> parser_type;
class generator_auto : public jellyfish::generator_manager_base {
const char** argv_;
const size_t argc_;
size_t index_;
public:
generator_auto(const char* argv[], size_t argc, int nb_pipes = 1, const char* shell = "/bin/sh")
: generator_manager_base(nb_pipes, shell)
, argv_(argv)
, argc_(argc)
, index_(0)
{ }
std::string get_cmd() {
std::string command;
if(index_ < argc_) { // Auto-detection of file type should go inside this if statement
command = "cat '";
command += argv_[index_++];
command += '\'';
}
return command;
}
};
int main(int argc, char *argv[]) {
generator_auto generator((const char**)(argv + 1), argc - 1);
stream_type streams(generator.pipes().end(), generator.pipes().end(), // No regular files
generator.pipes().begin(), generator.pipes().end()); // Only pipes
parser_type parser(64, 100, 1, streams);
return 0;
}
Thanks for the quick reply, and the code! I also think that having stream_manager
support directly reading from std::istream
would be a nice (additional) solution. It would enable using the parser to read from arbitrarily formatted files without the need to spawn off any sub-processes (if there's a reason that's desirable). Would the solution you have above be easy to integrate into a "paired end" parser? I suppose all you have to do is keep the pipes coming from the generator in sync, but the stream_type
would probably have to be different.
I agree that this method, although very easy for me to implement (basically made one method virtual), is a little heavy weight. I'll implement reading from std::istream
.
Regarding "paired end" parser. Suppose you have an external program called pe_merge
that takes two fasta files and output the paired reads one after the other, then the get_cmd()
could return, for each pair of files: pe_merge 'file1' 'file2'
. If, in addition, you make sure that the number of sequence read in each group by the whole_sequence_parser
class is even (2nd argument to constructor), then you are guaranteed to receive the pairs together (even read is first read, odd read is the mate pair).
Hrmm --- that's a very interesting approach to having the above code support paired-end reads. Actually, such an approach might help unify code to support another feature request by users . . . interleaved fastq format! Before the user requests, I really did not know this format was in wide use in any way!
Hi Guillaume,
We have a number of Sailfish and Salmon users asking for us to directly support gzipped fasta/q files. Right now, the only way to achieve this online (i.e. have sailfish or salmon read directly from the compressed file) is to use process substitution or named pipes. I like this solution, as I find it rather elegant (and it moves the decompression burden --- and the burden of supporting many different file formats --- off of the tool itself). However, this seems to be a very popular request, and the process substitution syntax does sometimes wreak havoc when it needs to be escaped (in qsub scripts or gnu-parallel commands etc.). Since we already rely on Boost, I was wondering if it is possible to use the Boost IOStream filters with the existing Jellyfish parser? For example, would it be possible to check the format of a file, if it is gzip, wrap it in the gzip decompression filter, and then pass it off to the Jellyfish parser somehow?
Thanks, Rob