jsh58 / NGmerge

Merging paired-end reads and removing adapters
MIT License
45 stars 15 forks source link

support for Bash process substitution #22

Closed dariomel closed 8 months ago

dariomel commented 3 years ago

it would be nice if this program allowed Bash process substitution to be used for the input files. For example, one might want to run a command like the following:

NGmerge -a -1 <(zcat R1.fastq.gz | head -n4000) -2 <(R2.fastq.gz | head -n4000) -o temp.fastq -i -v 

Currently, the above command causes the program to fail with the following error:

Processing files: /dev/fd/63,/dev/fd/62
Error! Input file does not follow fastq format

Below is an example of modifications to the code that work on my system (Ubuntu 16.04). The modified code starts after the comment "push back chars". The solution is to use gzdopen instead of gzopen. See also the attached diff file diff.txt.

bool openRead(char* inFile, File* in) {

  // open file or stdin
  bool stdinBool = (strcmp(inFile, "-") ? false : true);
  FILE* dummy = (stdinBool ? stdin : fopen(inFile, "r"));
  if (dummy == NULL)
    exit(error(inFile, ERROPEN));

  // check for gzip compression: magic number 0x1F, 0x8B
  bool gzip = true;
  int save = 0;  // first char to pushback (for stdin)
  int i, j;
  for (i = 0; i < 2; i++) {
    j = fgetc(dummy);
    if (j == EOF)
      exit(error(inFile, ERROPEN));
    if ( (i && (unsigned char) j != 0x8B)
        || (! i && (unsigned char) j != 0x1F) ) {
      gzip = false;
      break;
    }
    if (! i)
      save = j;
  }

  // push back chars
  if (ungetc(j, dummy) == EOF)
    exit(error("", ERRUNGET));
  if (i && ungetc(save, dummy) == EOF)
    exit(error("", ERRUNGET));

  // open file
  if (! stdinBool)
    rewind(dummy);
  in->f = dummy;
  if (gzip) {
    in->gzf = gzdopen(fileno(in->f), "r");
    if (in->gzf == NULL)
      exit(error(inFile, ERROPEN));
  }

  return gzip;
}
jsh58 commented 3 years ago

Thanks for the comment and suggestion. It is true that the program will fail with process substitution due to the inability to push back characters to the stream. However, the choice was made to support piping, and your proposed solution breaks that functionality. So if your modified code works for you, good. A better solution would require a bit more thought.

dariomel commented 3 years ago

Indeed, this is just a feature request and the code I provided was meant to be an example of using gzdopen to avoid closing and reopening the input files. I did not test the modified program with all possible combinations of inputs. For example, my hacks involving fclose in function runProgram cause Segmentation fault when running the program with gzipped input files. However, after omitting those changes, the following commands produced identical output files on my system:

NGmerge -a -1 R1.fastq.gz -2 R2.fastq.gz -o temp1.fastq -i -v -y

zcat R1.fastq.gz | NGmerge -a -1 - -2 R2.fastq.gz -o temp2.fastq -i -v -y

s='H;0~4{g;s/\n//;y/\n/\t/;p;z;h}'; paste -d'\n' <(sed -n -e$s <(zcat R1.fastq.gz)) <(sed -n -e$s <(zcat R2.fastq.gz)) | tr '\t' '\n' | NGmerge -a -1 - -o temp3.fastq -i -v -y

where the last command uses sed, paste, and tr to interleave the input FASTQ files. So it seems that support for "piping" is not broken. I agree that a better solution would require a bit more thought, and definitely more testing.

jsh58 commented 3 years ago

I was imprecise when I wrote that support for piping was broken with your modified code. It is the failure to throw an error with a streamed gzip-compressed file:

$ cat zzz3.fq.gz | ./NGmerge_edit -1- -o /dev/null -av
Warning: only one input file specified -- assuming interleaved
Processing files: -,(interleaved)
  Fragments (pairs of reads) analyzed: 0
  Successfully stitched: 0

Here are the results with the current version of NGmerge:

$ cat zzz3.fq.gz | ./NGmerge -1- -o /dev/null -av
Warning: only one input file specified -- assuming interleaved
Error! Cannot pipe in gzip compressed file (use zcat instead)

$ zcat zzz3.fq.gz | ./NGmerge -1- -o /dev/null -av
Warning: only one input file specified -- assuming interleaved
Processing files: -,(interleaved)
  Fragments (pairs of reads) analyzed: 2
  Adapters removed: 0