pb-cdunn commented 8 years ago

Unfortunately, the old test-case got deleted. (I should have copied it.) But I can post what I learned about it:

This crashes repeatably, in only ~3 minutes:
  daligner -v -k18 -w8 -h480 -t16 -H15000 -e0.7 -l4800 -M32 raw_reads.202 raw_reads.112

I'll try to get a stack-trace ...

align.c:159
 155 void Free_Work_Data(Work_Data *ework)
 156 { _Work_Data *work = (_Work_Data *) ework;
 157   if (work->vector != NULL)
 158     free(work->vector);
 159   if (work->cells != NULL)
 160     free(work->cells);
 161   if (work->trace != NULL)
 162     free(work->trace);
 163   if (work->points != NULL)
 164     free(work->points);
 165   free(work);
 166 }

filter.c:1834
1500 void Match_Filter(char *aname, HITS_DB *ablock, char *bname, HITS_DB *bblock,
1501                   int self, int comp, Align_Spec *aspec)
...
1833     for (i = 0; i < NTHREADS; i++)
1834       Free_Work_Data(parmr[i].work);

daligner.c:695
519 int main(int argc, char *argv[])
...
695             Match_Filter(aroot,&cblock,broot,&bblock,0,1,asettings);
696             Close_DB(&bblock);
697             free(broot);

Memory is too large for valgrind (24GB). I've added some memory diagnostics, but nothing looks bad. The problem occurs near the end of Match_Filter, between New_Work_Data() and Free_Work_Data(). When I have time, I will try to link with google-tcmalloc, which has a nice memory-checker.

Hi, Chris:

For some sequences comparison, the function Local_Alignment() in align.c does not reserve enough memory.  (It is hard to predict the right amount of memory for “arbitrary sequences”.)  If you chance ling 1769, `wsize = VectorEl*10000;` to `wsize = VectorEl*20000;` . The task will finish without segfault. (see where wsize is used for allocating memory… etc.) I hope this helps.

Cheers,
Jason

Yes, that "fix" works for this example, but it doesn't explain the crash. Memory is being freed twice somewhere. It's probably caused by another buffer over-run somehow. We need to know where, and we need to detect the problem at runtime to avoid these annoying, uninformative crashes.

That's all I have right now.

pb-cdunn commented 8 years ago

This is what Jason had posted in google-chat:

Change 10000 to 20000 in the following code inside aligne.c
1766     if (hgh-low >= 7500)
1767       wsize = VectorEl*(hgh-low+1);
1768     else
1769       wsize = VectorEl*20000;
I think some "single molecule weirdness" casue some big gaps.. wsize (I guess is the memory used for the band need to be bigger)
however, this fixe will use more memory

thegenemyers commented 8 years ago

Christopher,

I think we may have fixed this one a few days ago (one of my

postdocs reported a bug where the alignment routine had a realloc check that was off by one). The character of the error matches what your are reporting. Moreover, I found a pretty bad bug involving masks that I introduced in the Oct. 31 commit and I've fixed a few other small things that were not quite right.

So could you please take the very most recent version (commited 10

minutes ago) and merge that into your working version? Also, I don't know who works with the code at DNAnexus but may want to propogate the news to them.

Also, did you guys notice the new DBdump and LAdump routines that

should make it easy for you to get info out of my framework (including things like trace points)?

Cheers,
   Gene

On 12/3/15, 5:25 PM, Christopher Dunn wrote:

Unfortunately, the old test-case got deleted. (I should have copied it.) But I can post what I learned about it:

|This crashes repeatably, in only ~3 minutes: daligner -v -k18 -w8 -h480 -t16 -H15000 -e0.7 -l4800 -M32 raw_reads.202 raw_reads.112

I'll try to get a stack-trace ...

align.c:159 155 void Free_Work_Data(Work_Data ework) 156 { _Work_Data work = (_Work_Data *) ework; 157 if (work->vector != NULL) 158 free(work->vector); 159 if (work->cells != NULL) 160 free(work->cells); 161 if (work->trace != NULL) 162 free(work->trace); 163 if (work->points != NULL) 164 free(work->points); 165 free(work); 166 }

filter.c:1834 1500 void Match_Filter(char aname, HITS_DB ablock, char bname, HITS_DB bblock, 1501 int self, int comp, Align_Spec *aspec) ... 1833 for (i = 0; i < NTHREADS; i++) 1834 Free_Work_Data(parmr[i].work);

daligner.c:695 519 int main(int argc, char *argv[]) ... 695 Match_Filter(aroot,&cblock,broot,&bblock,0,1,asettings); 696 Close_DB(&bblock); 697 free(broot);

Memory is too large for valgrind (24GB). I've added some memory diagnostics, but nothing looks bad. The problem occurs near the end of Match_Filter, between New_Work_Data() and Free_Work_Data(). When I have time, I will try to link with google-tcmalloc, which has a nice memory-checker. | |Hi, Chris:

For some sequences comparison, the function Local_Alignment() in align.c does not reserve enough memory. (It is hard to predict the right amount of memory for “arbitrary sequences”.) If you chance ling 1769, wsize = VectorEl*10000; to wsize = VectorEl*20000; . The task will finish without segfault. (see where wsize is used for allocating memory… etc.) I hope this helps.

Cheers, Jason Yes, that "fix" works for this example, but it doesn't explain the crash. Memory is being freed twice somewhere. It's probably caused by another buffer over-run somehow. We need to know where, and we need to detect the problem at runtime to avoid these annoying, uninformative crashes.

That's all I have right now.

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/32.

pb-cdunn commented 8 years ago

Great! We will rebase onto your latest code this week. Since I lost the test-case, I won't be able to verify the fix, but no news is good news.

Thanks for telling us about DBdump and LAdump. We'll look into that. Might be helpful.

Maybe you would be interested in some of the changes in our fork. I can think of a couple at least:

A) I added -u# to DBsplit, to set the unit for the -s# option. The reason is that I like to set a tiny block-size in order to force a very large number of blocks, to test distributed behavior. Also, I have a tiny, super-fast test-case with error-free reads from a synthesized template, which serves as an excellent regression-test (since the result is 100% verifiable) and I like to force that tiny input to be split into 2 blocks to increase the coverage of the integration-test.

B) Someone submitted a change to fasta2DB which accepts long lines. That's really helpful because I like to eyeball Sequel data by viewing the start of each line via less without line-wrapping. Also, with error-free in silico sampling, it's helpful to select a long string of letters and search for it elsewhere, to analyze our algorithms. Unfortunately, this submitted change was mixed with code to accept compressed fasta (which we'll drop someday because that could just as easily be handled via standard UNIX pipes), so you won't want to merge this code upstream. But maybe you could come up with a better way. The main idea is that it's ok to rely on the fasta header to indicate when the buffer needs to be increased.

thegenemyers commented 8 years ago

Chris,

Could you please try out the following version of DBsplit.c and

fasta2DB.c The changes are as follows:

fasta2DB: Should now work with arbitrarily long sequence lines. The header lines still need to be of length less than MAX_NAME (set to 10,000 in DB.h)

DBsplit: Rather than add another option, I simply made the -s option a real number parameter. So anyone that uses it with an int, e.g. -s200, will get the behavior that they expect, but if some sneaky person :-), wanted say extra small blocks, then they could say -s.01 and that would give them a block size of 1,000,000 * .01 = 10,000. OK with you?

Let me know if the codes are all good in your hands, and then I will commit if I hear back that all is good.

-- Gene

On 12/8/15, 5:30 PM, Christopher Dunn wrote:

Great! We will rebase onto your latest code this week. Since I lost the test-case, I won't be able to verify the fix, but no news is good news.

Thanks for telling us about DBdump and LAdump. We'll look into that. Might be helpful.

Maybe you would be interested in some of the changes in our fork. I can think of a couple at least:

A) I added |-u#| to DBsplit, to set the unit for the |-s#| option. The reason is that I like to set a tiny block-size in order to force a very large number of blocks, to test distributed behavior. Also, I have a tiny, super-fast test-case with error-free reads from a synthesized template, which serves as an excellent regression-test (since the result is 100% verifiable) and I like to force that tiny input to be split into 2 blocks to increase the coverage of the integration-test.

B) Someone submitted a change to fasta2DB which accepts long lines. That's really helpful because I like to eyeball Sequel data by viewing the start of each line via less without line-wrapping. Also, with error-free /in silico/ sampling, it's helpful to select a long string of letters and search for it elsewhere, to analyze our algorithms. Unfortunately, this submitted change was mixed with code to accept compressed fasta (which we'll drop someday because that could just as easily be handled via standard UNIX pipes), so you won't want to merge this code upstream. But maybe you could come up with a better way. The main idea is that it's ok to rely on the fasta header to indicate when the buffer needs to be increased.

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/32#issuecomment-162936834.

/****\

*
Copyright (c) 2014, Dr. Eugene W. Myers (EWM). All rights reserved. *
*
Redistribution and use in source and binary forms, with or without modification, *
are permitted provided that the following conditions are met: *
*
· Redistributions of source code must retain the above copyright notice, this *
list of conditions and the following disclaimer. *
*
· Redistributions in binary form must reproduce the above copyright notice, this *
list of conditions and the following disclaimer in the documentation and/or *
other materials provided with the distribution. *
*
· The name of EWM may not be used to endorse or promote products derived from *
this software without specific prior written permission. *
*
THIS SOFTWARE IS PROVIDED BY EWM ”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, *
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND *
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL EWM BE LIABLE *
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES *
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS *
OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY *
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING *
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN *
IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. *
*
For any issues regarding this software and its use, contact EWM at: *
*
Eugene W. Myers Jr. *
Bautzner Str. 122e *
01099 Dresden *
GERMANY *
Email: gene.myers@gmail.com *
* ****/

/*** *

Split a .db into a set of sub-database blocks for use by the Dazzler:
Divide the database .db conceptually into a series of blocks referable to on the
command line as .1.db, .2.db, ... If the -x option is set then all reads
less than the given length are ignored, and if the -a option is not set then secondary
reads from a given well are also ignored. The remaining reads are split amongst the
blocks so that each block is of size -s * 1Mbp except for the last which necessarily
contains a smaller residual. The default value for -s is 400Mbp because blocks of this
size can be compared by our "overlapper" dalign in roughly 16Gb of memory. The blocks
are very space efficient in that their sub-index of the master .idx is computed on the
fly when loaded, and the .bps file of base pairs is shared with the master DB. Any
tracks associated with the DB are also computed on the fly when loading a database block. *
Author: Gene Myers
Date : September 2013
Mod : New splitting definition to support incrementality, and new stub file format
Date : April 2014 * ****/

include

include "DB.h"

ifdef HIDE_FILES

define PATHSEP "/."

else

define PATHSEP "/"

endif

static char *Usage = "[-a] [-x] [-s<float(200.)>] path:db|dam";

int main(int argc, char argv[]) { HITS_DB db, dbs; int64 dbpos; FILE dbfile, *ixfile; int status;

int ALL; int CUTOFF; int64 SIZE;

{ int i, j, k; int flags[128]; char *eptr; float size;

ARG_INIT("DBsplit")

CUTOFF = 0;
size   = 200;

j = 1;
for (i = 1; i < argc; i++)
  if (argv[i][0] == '-')
    switch (argv[i][1])
    { default:
        ARG_FLAGS("a")
        break;
      case 'x':
        ARG_NON_NEGATIVE(CUTOFF,"Min read length cutoff")
        break;
      case 's':
        ARG_REAL(size)
        if (size <= 0.)
          { fprintf(stderr,"%s: Block size must be a positive number\n",Prog_Name);
            exit (1);
          }
        break;
    }
  else
    argv[j++] = argv[i];
argc = j;

SIZE = size*1000000ll;
ALL  = flags['a'];

if (argc != 2)
  { fprintf(stderr,"Usage: %s %s\n",Prog_Name,Usage);
    exit (1);
  }

}

// Open db

status = Open_DB(argv[1],&db); if (status < 0) exit (1); if (db.part > 0) { fprintf(stderr,"%s: Cannot be called on a block: %s\n",Prog_Name,argv[1]); exit (1); }

{ char _pwd, *root; char buffer[2_MAX_NAME+100]; int nfiles; int i;

pwd    = PathTo(argv[1]);
if (status)
  { root   = Root(argv[1],".dam");
    dbfile = Fopen(Catenate(pwd,"/",root,".dam"),"r+");
  }
else
  { root   = Root(argv[1],".db");
    dbfile = Fopen(Catenate(pwd,"/",root,".db"),"r+");
  }
ixfile = Fopen(Catenate(pwd,PATHSEP,root,".idx"),"r+");
if (dbfile == NULL || ixfile == NULL)
  exit (1);
free(pwd);
free(root);

if (fscanf(dbfile,DB_NFILE,&nfiles) != 1)
  SYSTEM_ERROR
for (i = 0; i < nfiles; i++)
  if (fgets(buffer,2*MAX_NAME+100,dbfile) == NULL)
    SYSTEM_ERROR

if (fread(&dbs,sizeof(HITS_DB),1,ixfile) != 1)
  SYSTEM_ERROR

if (dbs.cutoff >= 0)
  { printf("You are about to overwrite the current partition settings.  This\n");
    printf("will invalidate any tracks, overlaps, and other derivative files.\n");
    printf("Are you sure you want to proceed? [Y/N] ");
    fflush(stdout);
    if (fgets(buffer,100,stdin) == NULL)
      SYSTEM_ERROR
    if (index(buffer,'n') != NULL || index(buffer,'N') != NULL)
      { printf("Aborted\n");
        fflush(stdout);
        fclose(dbfile);
        exit (1);
      }
  }

dbpos = ftello(dbfile);
fseeko(dbfile,dbpos,SEEK_SET);
fprintf(dbfile,DB_NBLOCK,0);
fprintf(dbfile,DB_PARAMS,SIZE,CUTOFF,ALL);

}

{ HITS_READ *reads = db.reads; int nreads = db.ureads; int64 totlen; int nblock, ireads, treads, rlen, fno; int i;

nblock = 0;
totlen = 0;
ireads = 0;
treads = 0;
fprintf(dbfile,DB_BDATA,0,0);
if (ALL)
  for (i = 0; i < nreads; i++)
    { rlen = reads[i].rlen;
      if (rlen >= CUTOFF)
        { ireads += 1;
          treads += 1;
          totlen += rlen;
          if (totlen >= SIZE)
            { fprintf(dbfile,DB_BDATA,i+1,treads);
              totlen = 0;
              ireads = 0;
              nblock += 1;
            }
        }
    }
else
  for (i = 0; i < nreads; i++)
    { rlen = reads[i].rlen;
      if (rlen >= CUTOFF && (reads[i].flags & DB_BEST) != 0)
        { ireads += 1;
          treads += 1;
          totlen += rlen;
          if (totlen >= SIZE)
            { fprintf(dbfile,DB_BDATA,i+1,treads);
              totlen = 0;
              ireads = 0;
              nblock += 1;
            }
        }
    }

if (ireads > 0)
  { fprintf(dbfile,DB_BDATA,nreads,treads);
    nblock += 1;
  }
fno = fileno(dbfile);
if (ftruncate(fno,ftello(dbfile)) < 0)
  SYSTEM_ERROR

fseeko(dbfile,dbpos,SEEK_SET);
fprintf(dbfile,DB_NBLOCK,nblock);

dbs.cutoff = CUTOFF;
dbs.all    = ALL;
dbs.treads = treads;
rewind(ixfile);
fwrite(&dbs,sizeof(HITS_DB),1,ixfile);

}

fclose(ixfile); fclose(dbfile); Close_DB(&db);

exit (0); }

/****\

*
Copyright (c) 2014, Dr. Eugene W. Myers (EWM). All rights reserved. *
*
Redistribution and use in source and binary forms, with or without modification, *
are permitted provided that the following conditions are met: *
*
· Redistributions of source code must retain the above copyright notice, this *
list of conditions and the following disclaimer. *
*
· Redistributions in binary form must reproduce the above copyright notice, this *
list of conditions and the following disclaimer in the documentation and/or *
other materials provided with the distribution. *
*
· The name of EWM may not be used to endorse or promote products derived from *
this software without specific prior written permission. *
*
THIS SOFTWARE IS PROVIDED BY EWM ”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, *
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND *
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL EWM BE LIABLE *
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES *
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS *
OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY *
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING *
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN *
IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. *
*
For any issues regarding this software and its use, contact EWM at: *
*
Eugene W. Myers Jr. *
Bautzner Str. 122e *
01099 Dresden *
GERMANY *
Email: gene.myers@gmail.com *
* ****/

/*** *

Add .fasta files to a DB:
Adds the given fasta files in the given order to .db. If the db does not exist
then it is created. All .fasta files added to a given data base must have the same
header format and follow Pacbio's convention. A file cannot be added twice and this
is enforced. The command either builds or appends to the ..idx and ..bps
files, where the index file (.idx) contains information about each read and their offsets
in the base-pair file (.bps) that holds the sequences where each base is compessed
into 2-bits. The two files are hidden by virtue of their names beginning with a '.'.
.db is effectively a stub file with given name that contains an ASCII listing
of the files added to the DB and possibly the block partitioning for the DB if DBsplit
has been called upon it. *
Author: Gene Myers
Date : May 2013
Modify: DB upgrade: now add to or create a DB depending on whether it exists, read
multiple .fasta files (no longer a stdin pipe).
Date : April 2014 * ****/

include

include <sys/stat.h>

include

include "DB.h"

ifdef HIDE_FILES

define PATHSEP "/."

else

define PATHSEP "/"

endif

static char *Usage = "[-v] [-p] path:string ( -f | input:fasta ... )";

static char number[128] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, };

typedef struct { int argc; char *argv; FILE input; int count; char *name; } File_Iterator;

File_Iterator _init_file_iterator(int argc, char _argv, FILE input, int first) { File_Iterator *it;

it = Malloc(sizeof(File_Iterator),"Allocating file iterator"); it->argc = argc; it->argv = argv; it->input = input; if (input == NULL) it->count = first; else { it->count = 1; rewind(input); } return (it); }

int next_file(File_Iterator *it) { static char nbuffer[MAX_NAME+8];

if (it->input == NULL) { if (it->count >= it->argc) return (0); it->name = it->argv[it->count++]; } else { char *eol;

  if (fgets(nbuffer,MAX_NAME+8,it->input) == NULL)
    { if (feof(it->input))
        return (0);
      SYSTEM_ERROR;
    }
  if ((eol = index(nbuffer,'\n')) == NULL)
    { fprintf(stderr,"%s: Line %d in file list is longer than %d chars!\n",
                     Prog_Name,it->count,MAX_NAME+7);
      it->name = NULL;
    }
  *eol = '\0';
  it->count += 1;
  it->name  = nbuffer;
}

return (1); }

int main(int argc, char argv[]) { FILE istub, ostub; char dbname; char root, pwd;

FILE bases, indx; int64 boff, ioff;

int ifiles, ofiles; char **flist;

HITS_DB db; int ureads; int64 offset;

FILE IFILE; char PNAME; int VERBOSE;

// Usage: [-v] path:string ( -f | input:fasta ... )

{ int i, j, k; int flags[128];

ARG_INIT("fasta2DB")

IFILE = NULL;
PNAME = NULL;

j = 1;
for (i = 1; i < argc; i++)
  if (argv[i][0] == '-')
    switch (argv[i][1])
    { default:
        ARG_FLAGS("v")
        break;
      case 'f':
        IFILE = fopen(argv[i]+2,"r");
        if (IFILE == NULL)
          { fprintf(stderr,"%s: Cannot open file of inputs '%s'\n",Prog_Name,argv[i]+2);
            exit (1);
          }
        break;
      case 'p':
        PNAME = argv[i]+2;
        break;
    }
  else
    argv[j++] = argv[i];
argc = j;

VERBOSE = flags['v'];

if ((IFILE == NULL && argc <= 2) || (IFILE != NULL && argc != 2))
  { fprintf(stderr,"Usage: %s %s\n",Prog_Name,Usage);
    exit (1);
  }

}

// Try to open DB file, if present then adding to DB, otherwise creating new DB. Set up // variables as follows: // dbname = full name of db = /.db // istub = open db file (if adding) or NULL (if creating) // ostub = new image of db file (will overwrite old image at end) // bases = .bps file positioned for appending // indx = .idx file positioned for appending // ureads = # of reads currently in db // offset = offset in .bps at which to place next sequence // ioff = offset in .idx file to truncate to if command fails // boff = offset in .bps file to truncate to if command fails // ifiles = # of .fasta files to add // ofiles = # of .fasta files already in db // flist = [0..ifiles+ofiles] list of file names (root only) added to db so far

{ int i;

root   = Root(argv[1],".db");
pwd    = PathTo(argv[1]);
dbname = Strdup(Catenate(pwd,"/",root,".db"),"Allocating db name");
if (dbname == NULL)
  exit (1);

if (IFILE == NULL)
  ifiles = argc-2;
else
  { File_Iterator *ng;

    ifiles = 0;
    ng = init_file_iterator(argc,argv,IFILE,2);
    while (next_file(ng))
      ifiles += 1;
    free(ng);
  }

istub = fopen(dbname,"r");
if (istub == NULL)
  { ofiles = 0;

    bases = Fopen(Catenate(pwd,PATHSEP,root,".bps"),"w+");
    indx  = Fopen(Catenate(pwd,PATHSEP,root,".idx"),"w+");
    if (bases == NULL || indx == NULL)
      exit (1);

    fwrite(&db,sizeof(HITS_DB),1,indx);

    ureads  = 0;
    offset  = 0;
    boff    = 0;
    ioff    = 0;
  }
else
  { if (fscanf(istub,DB_NFILE,&ofiles) != 1)
      SYSTEM_ERROR

    bases = Fopen(Catenate(pwd,PATHSEP,root,".bps"),"r+");
    indx  = Fopen(Catenate(pwd,PATHSEP,root,".idx"),"r+");
    if (bases == NULL || indx == NULL)
      exit (1);

    if (fread(&db,sizeof(HITS_DB),1,indx) != 1)
      SYSTEM_ERROR
    fseeko(bases,0,SEEK_END);
    fseeko(indx, 0,SEEK_END);

    ureads = db.ureads;
    offset = ftello(bases);
    boff   = offset;
    ioff   = ftello(indx);
  }

flist  = (char **) Malloc(sizeof(char *)*(ofiles+ifiles),"Allocating file list");
ostub  = Fopen(Catenate(pwd,"/",root,".dbx"),"w+");
if (ostub == NULL || flist == NULL)
  exit (1);

fprintf(ostub,DB_NFILE,ofiles+ifiles);
for (i = 0; i < ofiles; i++)
  { int  last;
    char prolog[MAX_NAME], fname[MAX_NAME];

    if (fscanf(istub,DB_FDATA,&last,fname,prolog) != 3)
      SYSTEM_ERROR
    if ((flist[i] = Strdup(fname,"Adding to file list")) == NULL)
      goto error;
    fprintf(ostub,DB_FDATA,last,fname,prolog);
  }

}

{ int maxlen; int64 totlen, count[4]; int pmax, rmax; HITS_READ prec; char read; int c; File_Iterator *ng;

//  Buffer for reads all in the same well

pmax = 100;
prec = (HITS_READ *) Malloc(sizeof(HITS_READ)*pmax,"Allocating record buffer");
if (prec == NULL)
  goto error;

//  Buffer for accumulating .fasta sequence over multiple lines

rmax  = MAX_NAME + 60000;
read  = (char *) Malloc(rmax+1,"Allocating line buffer");
if (read == NULL)
  goto error;

totlen = 0;              //  total # of bases in new .fasta files
maxlen = 0;              //  longest read in new .fasta files
for (c = 0; c < 4; c++)  //  count of acgt in new .fasta files
  count[c] = 0;

//  For each new .fasta file do:

ng = init_file_iterator(argc,argv,IFILE,2);
while (next_file(ng))
  { FILE *input;
    char *path, *core, *prolog;
    int   nline, eof, rlen, pcnt;
    int   pwell;

    if (ng->name == NULL) goto error;

    //  Open it: <path>/<core>.fasta, check that core is not too long,
    //           and checking that it is not already in flist.

    path  = PathTo(ng->name);
    core  = Root(ng->name,".fasta");
    if ((input = Fopen(Catenate(path,"/",core,".fasta"),"r")) == NULL)
      goto error;
    free(path);
    if (strlen(core) >= MAX_NAME)
      { fprintf(stderr,"%s: File name over %d chars: '%.200s'\n",
                       Prog_Name,MAX_NAME,core);
        goto error;
      }

    { int j;

      for (j = 0; j < ofiles; j++)
        if (strcmp(core,flist[j]) == 0)
          { fprintf(stderr,"%s: File %s.fasta is already in database %s.db\n",
                           Prog_Name,core,Root(argv[1],".db"));
            goto error;
          }
    }

    //  Get the header of the first line.  If the file is empty skip.

    pcnt  = 0;
    rlen  = 0;
    nline = 1;
    eof   = (fgets(read,MAX_NAME,input) == NULL);
    if (eof || strlen(read) < 1)
      { fprintf(stderr,"Skipping '%s', file is empty!\n",core);
        fclose(input);
        free(core);
        continue;
      }

    //   Add the file name to flist

    if (VERBOSE)
      { fprintf(stderr,"Adding '%s' ...\n",core);
        fflush(stderr);
      }
    flist[ofiles++] = core;

    // Check that the first line has PACBIO format, and record prolog in 'prolog'.

    if (read[strlen(read)-1] != '\n')
      { fprintf(stderr,"File %s.fasta, Line 1: Fasta line is too long (> %d chars)\n",
                       core,MAX_NAME-2);
        goto error;
      }
    if (!eof && read[0] != '>')
      { fprintf(stderr,"File %s.fasta, Line 1: First header in fasta file is missing\n",core);
        goto error;
      }

    { char *find;
      int   well, beg, end, qv;

      find = index(read+1,'/');
      if (find != NULL && sscanf(find+1,"%d/%d_%d RQ=0.%d\n",&well,&beg,&end,&qv) >= 3)
        { *find = '\0';
          if (PNAME != NULL)
            prolog = Strdup(PNAME,"Extracting prolog");
          else
            prolog = Strdup(read+1,"Extracting prolog");
          *find = '/';
          if (prolog == NULL) goto error;
        }
      else
        { fprintf(stderr,"File %s.fasta, Line %d: Pacbio header line format error\n",
                         core,nline);
          goto error;
        }
    }

    //  Read in all the sequences until end-of-file

    { int i, x;

      pwell = -1;
      while (!eof)
        { int   beg, end, clen;
          int   well, qv;
          char *find;

          find = index(read+(rlen+1),'/');
          if (find == NULL)
            { fprintf(stderr,"File %s.fasta, Line %d: Pacbio header line format error\n",
                             core,nline);
              goto error;
            }
          if (PNAME == NULL)
            { *find = '\0';
              if (strcmp(read+(rlen+1),prolog) != 0)
                { fprintf(stderr,
                          "File %s.fasta, Line %d: Pacbio header line name inconsistent\n",
                          core,nline);
                  goto error;
                }
              *find = '/';
            }
          x = sscanf(find+1,"%d/%d_%d RQ=0.%d\n",&well,&beg,&end,&qv);
          if (x < 3)
            { fprintf(stderr,"File %s.fasta, Line %d: Pacbio header line format error\n",
                             core,nline);
              goto error;
            }
          else if (x == 3)
            qv = 0;

          rlen  = 0;
          while (1)
            { eof = (fgets(read+rlen,MAX_NAME,input) == NULL);
              nline += 1;
              x = strlen(read+rlen)-1;
              if (read[rlen+x] != '\n')
                { if (read[rlen] == '>')
                    { fprintf(stderr,"File %s.fasta, Line %d:",core,nline);
                      fprintf(stderr," Fasta header line is too long (> %d chars)\n",
                                     MAX_NAME-2);
                      goto error;
                    }
                  else
                    x += 1;
                }
              if (eof || read[rlen] == '>')
                break;
              rlen += x;
              if (rlen + MAX_NAME > rmax)
                { rmax = ((int) (1.2 * rmax)) + 1000 + MAX_NAME;
                  read = (char *) realloc(read,rmax+1);
                  if (read == NULL)
                    { fprintf(stderr,"File %s.fasta, Line %d:",core,nline);
                      fprintf(stderr," Out of memory (Allocating line buffer)\n");
                      goto error;
                    }
                }
            }
          read[rlen] = '\0';

          for (i = 0; i < rlen; i++)
            { x = number[(int) read[i]];
              count[x] += 1;
              read[i]   = (char) x;
            }
          ureads += 1;
          totlen += rlen;
          if (rlen > maxlen)
            maxlen = rlen;

          prec[pcnt].origin = well;
          prec[pcnt].fpulse = beg;
          prec[pcnt].rlen   = rlen;
          prec[pcnt].boff   = offset;
          prec[pcnt].coff   = -1;
          prec[pcnt].flags  = qv;

          Compress_Read(rlen,read);
          clen = COMPRESSED_LEN(rlen);
          fwrite(read,1,clen,bases);
          offset += clen;

          if (pwell == well)
            { prec[pcnt].flags |= DB_CSS;
              pcnt += 1;
              if (pcnt >= pmax)
                { pmax = ((int) (pcnt*1.2)) + 100;
                  prec = (HITS_READ *) realloc(prec,sizeof(HITS_READ)*pmax);
                  if (prec == NULL)
                    { fprintf(stderr,"File %s.fasta, Line %d: Out of memory",core,nline);
                      fprintf(stderr," (Allocating read records)\n");
                      goto error;
                    }
                }
            }
          else if (pcnt == 0)
            pcnt += 1;
          else
            { x = 0;
              for (i = 1; i < pcnt; i++)
                if (prec[i].rlen > prec[x].rlen)
                  x = i;
              prec[x].flags |= DB_BEST;
              fwrite(prec,sizeof(HITS_READ),pcnt,indx);
              prec[0] = prec[pcnt];
              pcnt = 1;
            }
          pwell = well;
        }

      //  Complete processing of .fasta file: flush last well group, write file line
      //      in db image, free prolog, and close file

      x = 0;
      for (i = 1; i < pcnt; i++)
        if (prec[i].rlen > prec[x].rlen)
          x = i;
      prec[x].flags |= DB_BEST;
      fwrite(prec,sizeof(HITS_READ),pcnt,indx);

      fprintf(ostub,DB_FDATA,ureads,core,prolog);
    }

    free(prolog);
    fclose(input);
  }

//  Finished loading all sequences: update relevant fields in db record

db.ureads = ureads;
if (istub == NULL)
  { for (c = 0; c < 4; c++)
      db.freq[c] = (float) ((1.*count[c])/totlen);
    db.totlen = totlen;
    db.maxlen = maxlen;
    db.cutoff = -1;
  }
else
  { for (c = 0; c < 4; c++)
      db.freq[c] = (float) ((db.freq[c]*db.totlen + (1.*count[c]))/(db.totlen + totlen));
    db.totlen += totlen;
    if (maxlen > db.maxlen)
      db.maxlen = maxlen;
  }

}

// If db has been previously partitioned then calculate additional partition points and // write to new db file image

if (db.cutoff >= 0) { int64 totlen, dbpos, size; int nblock, ireads, tfirst, rlen; int ufirst, cutoff, allflag; HITS_READ record; int i;

  if (VERBOSE)
    { fprintf(stderr,"Updating block partition ...\n");
      fflush(stderr);
    }

  //  Read the block portion of the existing db image getting the indices of the first
  //    read in the last block of the exisiting db as well as the partition parameters.
  //    Copy the old image block information to the new block information (except for
  //    the indices of the last partial block)

  if (fscanf(istub,DB_NBLOCK,&nblock) != 1)
    SYSTEM_ERROR
  dbpos = ftello(ostub);
  fprintf(ostub,DB_NBLOCK,0);
  if (fscanf(istub,DB_PARAMS,&size,&cutoff,&allflag) != 3)
    SYSTEM_ERROR
  fprintf(ostub,DB_PARAMS,size,cutoff,allflag); 
  if (allflag)
    allflag = 0;
  else
    allflag = DB_BEST;

  nblock -= 1;
  for (i = 0; i <= nblock; i++)
    { if (fscanf(istub,DB_BDATA,&ufirst,&tfirst) != 2)
        SYSTEM_ERROR
      fprintf(ostub,DB_BDATA,ufirst,tfirst);
    }

  //  Seek the first record of the last block of the existing db in .idx, and then
  //    compute and record partition indices for the rest of the db from this point
  //    forward.

  fseeko(indx,sizeof(HITS_DB)+sizeof(HITS_READ)*ufirst,SEEK_SET);
  totlen = 0;
  ireads = 0;
  for (i = ufirst; i < ureads; i++)
    { if (fread(&record,sizeof(HITS_READ),1,indx) != 1)
        SYSTEM_ERROR
      rlen = record.rlen;
      if (rlen >= cutoff && (record.flags & DB_BEST) >= allflag)
        { ireads += 1;
          tfirst += 1;
          totlen += rlen;
          if (totlen >= size)
            { fprintf(ostub," %9d %9d\n",i+1,tfirst);
              totlen = 0;
              ireads = 0;
              nblock += 1;
            }
        }
    }

  if (ireads > 0)
    { fprintf(ostub,DB_BDATA,ureads,tfirst);
      nblock += 1;
    }

  db.treads = tfirst;

  fseeko(ostub,dbpos,SEEK_SET);
  fprintf(ostub,DB_NBLOCK,nblock);    //  Rewind and record the new number of blocks
}

else db.treads = ureads;

rewind(indx); fwrite(&db,sizeof(HITS_DB),1,indx); // Write the finalized db record into .idx

rewind(ostub); // Rewrite the number of files actually added fprintf(ostub,DB_NFILE,ofiles);

if (istub != NULL) fclose(istub); fclose(ostub); fclose(indx); fclose(bases);

rename(Catenate(pwd,"/",root,".dbx"),dbname); // New image replaces old image

exit (0);

// Error exit: Either truncate or remove the .idx and .bps files as appropriate. // Remove the new image file /.dbx

error: if (ioff != 0) { fseeko(indx,0,SEEK_SET); if (ftruncate(fileno(indx),ioff) < 0) SYSTEM_ERROR } if (boff != 0) { fseeko(bases,0,SEEK_SET); if (ftruncate(fileno(bases),boff) < 0) SYSTEM_ERROR } fclose(indx); fclose(bases); if (ioff == 0) unlink(Catenate(pwd,PATHSEP,root,".idx")); if (boff == 0) unlink(Catenate(pwd,PATHSEP,root,".bps"));

if (istub != NULL) fclose(istub); fclose(ostub); unlink(Catenate(pwd,"/",root,".dbx"));

exit (1); }

thegenemyers commented 8 years ago

PS. The fasta2DB changes does not use the header to determine the length of the sequence line, it still dynamically keeps expanding to fit whatever comes. I want fasta2DB to work even when the fasta comes from somewhere other than a Pacbio machine. -- Gene

On 12/13/15, 1:29 PM, Gene Myers wrote:

Chris,

Could you please try out the following version of DBsplit.c and fasta2DB.c The changes are as follows:

fasta2DB: Should now work with arbitrarily long sequence lines. The header lines still need to be of length less than MAX_NAME (set to 10,000 in DB.h)

DBsplit: Rather than add another option, I simply made the -s option a real number parameter. So anyone that uses it with an int, e.g. -s200, will get the behavior that they expect, but if some sneaky person :-), wanted say extra small blocks, then they could say -s.01 and that would give them a block size of 1,000,000 * .01 = 10,000. OK with you?

Let me know if the codes are all good in your hands, and then I will commit if I hear back that all is good.

-- Gene

On 12/8/15, 5:30 PM, Christopher Dunn wrote:

Great! We will rebase onto your latest code this week. Since I lost the test-case, I won't be able to verify the fix, but no news is good news.

Thanks for telling us about DBdump and LAdump. We'll look into that. Might be helpful.

Maybe you would be interested in some of the changes in our fork. I can think of a couple at least:

A) I added |-u#| to DBsplit, to set the unit for the |-s#| option. The reason is that I like to set a tiny block-size in order to force a very large number of blocks, to test distributed behavior. Also, I have a tiny, super-fast test-case with error-free reads from a synthesized template, which serves as an excellent regression-test (since the result is 100% verifiable) and I like to force that tiny input to be split into 2 blocks to increase the coverage of the integration-test.

B) Someone submitted a change to fasta2DB which accepts long lines. That's really helpful because I like to eyeball Sequel data by viewing the start of each line via less without line-wrapping. Also, with error-free /in silico/ sampling, it's helpful to select a long string of letters and search for it elsewhere, to analyze our algorithms. Unfortunately, this submitted change was mixed with code to accept compressed fasta (which we'll drop someday because that could just as easily be handled via standard UNIX pipes), so you won't want to merge this code upstream. But maybe you could come up with a better way. The main idea is that it's ok to rely on the fasta header to indicate when the buffer needs to be increased.

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/32#issuecomment-162936834.

pb-cdunn commented 8 years ago

Could you please try out the following version of DBsplit.c and fasta2DB.c The changes are as follows: fasta2DB: Should now work with arbitrarily long sequence lines.

Wunderbar!

The header lines still need to be of length less than MAX_NAME (set to 10,000 in DB.h)

Gut!

The fasta2DB changes does not use the header to determine the length of the sequence line, it still dynamically keeps expanding to fit whatever comes.

Sehr gut!

DBsplit: Rather than add another option, I simply made the -s option a real number parameter. So anyone that uses it with an int, e.g. -s200, will get the behavior that they expect, but if some sneaky person :-), wanted say extra small blocks, then they could say -s.01 and that would give them a block size of 1,000,000 * .01 = 10,000. OK with you?

Nifty. That's fine for my use-case.

pb-cdunn commented 8 years ago

Instead of posting code into comments, you can use a GitHub gist. And if you want to delete the code from your current comment, as owner of this repo you can Edit any comment.

thegenemyers commented 8 years ago

Let me know if the codes work, I did a bit of testing, but I would feel more comfortable if you beat on these versions a bit and then got back to me before I commit them. Thanks, Gene

On 12/13/15, 3:52 PM, Christopher Dunn wrote:

Could you please try out the following version of DBsplit.c and
fasta2DB.c The changes are as follows: fasta2DB: Should now work
with arbitrarily long sequence lines.

Wunderbar!

The header lines still need to be of length less than MAX_NAME
(set to 10,000 in DB.h)

Gut!

The fasta2DB changes does not use the header to determine the
length of the sequence line, it still dynamically keeps expanding
to fit whatever comes.

Sehr gut!

DBsplit: Rather than add another option, I simply made the -s
option a *real* number parameter. So anyone that uses it with an
int, e.g. -s200, will get the behavior that they expect, but if
some sneaky person :-), wanted say extra small blocks, then they
could say -s.01 and that would give them a block size of 1,000,000
* .01 = 10,000. OK with you?

Nifty. That's fine for my use-case.

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/32#issuecomment-164266384.

pb-cdunn commented 8 years ago

Tested. Both work perfectly.

Btw, the convention on GitHub is to have a file called LICENSE which applies to the whole repo, so you don't need the copyright in every file.

thegenemyers commented 8 years ago

Thanks,

 I'll commit both in a new version shortly.  I appreciate the

LICENSE file idea, it sure will make things easier.

 -- Gene

On 12/14/15, 4:15 PM, Christopher Dunn wrote:

Tested. Both work perfectly.

https://gist.github.com/pb-cdunn/41da6632b0cc4ff8b059

https://gist.github.com/pb-cdunn/a71046d8520c195e74e4

Btw, the convention on GitHub is to have a file called |LICENSE| which applies to the whole repo, so you don't need the copyright in every file.

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/32#issuecomment-164463277.

thegenemyers commented 8 years ago

Chris,

The mods have been commited.

-- Gene

On 12/14/15, 4:15 PM, Christopher Dunn wrote:

Tested. Both work perfectly.

https://gist.github.com/pb-cdunn/41da6632b0cc4ff8b059

https://gist.github.com/pb-cdunn/a71046d8520c195e74e4

Btw, the convention on GitHub is to have a file called |LICENSE| which applies to the whole repo, so you don't need the copyright in every file.

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/32#issuecomment-164463277.

pb-cdunn commented 8 years ago

I also ran valgrind. No significant errors. In case you're interested ...

valgrind --track-origins=yes --leak-check=full fasta2DB -v raw_reads -f0-rawreads/input.fofn

==9905== Syscall param write(buf) points to uninitialised byte(s)
==9905==    at 0x4F223B0: __write_nocancel (syscall-template.S:81)
==9905==    by 0x4EAFA82: _IO_file_write@@GLIBC_2.2.5 (fileops.c:1261)
==9905==    by 0x4EB0F5B: new_do_write (fileops.c:538)
==9905==    by 0x4EB0F5B: _IO_do_write@@GLIBC_2.2.5 (fileops.c:511)
==9905==    by 0x4EB1C4E: _IO_switch_to_get_mode (genops.c:184)
==9905==    by 0x4EAF43D: _IO_file_seekoff@@GLIBC_2.2.5 (fileops.c:969)
==9905==    by 0x4EA8B8C: rewind (rewind.c:36)
==9905==    by 0x40312C: main (fasta2DB.c:639)
==9905==  Address 0x403807c is not stack'd, malloc'd or (recently) free'd
==9905==  Uninitialised value was created by a heap allocation
==9905==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==9905==    by 0x40357D: Malloc (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DAZZ_DB/fasta2DB)
==9905==    by 0x401CCF: main (fasta2DB.c:308)
==9905==
==9905==
==9905== HEAP SUMMARY:
==9905==     in use at exit: 74,777 bytes in 11 blocks
==9905==   total heap usage: 19 allocs, 8 frees, 77,744 bytes allocated
==9905==
==9905== LEAK SUMMARY:
==9905==    definitely lost: 0 bytes in 0 blocks
==9905==    indirectly lost: 0 bytes in 0 blocks
==9905==      possibly lost: 0 bytes in 0 blocks
==9905==    still reachable: 74,777 bytes in 11 blocks
==9905==         suppressed: 0 bytes in 0 blocks

Possibly the .idx array was not initialized beyond its last used element.

valgrind --track-origins=yes --leak-check=full --show-leak-kinds=all DBsplit -x5 -s.065536 -a raw_reads

==10075== HEAP SUMMARY:
==10075==     in use at exit: 125 bytes in 2 blocks
==10075==   total heap usage: 12 allocs, 10 frees, 4,514 bytes allocated
==10075==
==10075== 8 bytes in 1 blocks are still reachable in loss record 1 of 2
==10075==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10075==    by 0x4EBF2B9: strdup (strdup.c:42)
==10075==    by 0x402232: Strdup (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DAZZ_DB/DBsplit)
==10075==    by 0x4012A2: main (DBsplit.c:89)
==10075==
==10075== 117 bytes in 1 blocks are still reachable in loss record 2 of 2
==10075==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10075==    by 0x4C2CA8B: realloc (vg_replace_malloc.c:692)
==10075==    by 0x4026DA: Catenate (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DAZZ_DB/DBsplit)
==10075==    by 0x40353A: Open_DB (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DAZZ_DB/DBsplit)
==10075==    by 0x401785: main (DBsplit.c:127)
==10075==
==10075== LEAK SUMMARY:
==10075==    definitely lost: 0 bytes in 0 blocks
==10075==    indirectly lost: 0 bytes in 0 blocks
==10075==      possibly lost: 0 bytes in 0 blocks
==10075==    still reachable: 125 bytes in 2 blocks
==10075==         suppressed: 0 bytes in 0 blocks

valgrind --leak-check=full --show-leak-kinds=all HPCdaligner -v -dal4 -t16 -h1 -e.70 -l1 -s1000 -H1 raw_reads 1-2

==10170== HEAP SUMMARY:
==10170==     in use at exit: 934 bytes in 5 blocks
==10170==   total heap usage: 9 allocs, 4 frees, 2,082 bytes allocated
==10170==
==10170== 12 bytes in 1 blocks are still reachable in loss record 1 of 5
==10170==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10170==    by 0x4EBF2B9: strdup (strdup.c:42)
==10170==    by 0x41A092: Strdup (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/HPCdaligner)
==10170==    by 0x4011C8: main (HPCdaligner.c:102)
==10170==
==10170== 80 bytes in 1 blocks are still reachable in loss record 2 of 5
==10170==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10170==    by 0x419F8D: Malloc (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/HPCdaligner)
==10170==    by 0x40129B: main (HPCdaligner.c:118)
==10170==
==10170== 119 bytes in 1 blocks are still reachable in loss record 3 of 5
==10170==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10170==    by 0x4C2CA8B: realloc (vg_replace_malloc.c:692)
==10170==    by 0x41A53A: Catenate (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/HPCdaligner)
==10170==    by 0x4023E2: main (HPCdaligner.c:234)
==10170==
==10170== 155 bytes in 1 blocks are still reachable in loss record 4 of 5
==10170==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10170==    by 0x4C2CA8B: realloc (vg_replace_malloc.c:692)
==10170==    by 0x41A63A: Numbered_Suffix (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/HPCdaligner)
==10170==    by 0x40278E: main (HPCdaligner.c:299)
==10170==
==10170== 568 bytes in 1 blocks are still reachable in loss record 5 of 5
==10170==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10170==    by 0x4EA4ECC: __fopen_internal (iofopen.c:73)
==10170==    by 0x41A14E: Fopen (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/HPCdaligner)
==10170==    by 0x40242D: main (HPCdaligner.c:236)
==10170==
==10170== LEAK SUMMARY:
==10170==    definitely lost: 0 bytes in 0 blocks
==10170==    indirectly lost: 0 bytes in 0 blocks
==10170==      possibly lost: 0 bytes in 0 blocks
==10170==    still reachable: 934 bytes in 5 blocks
==10170==         suppressed: 0 bytes in 0 blocks

valgrind --leak-check=full --show-leak-kinds=all daligner -v -h1 -t16 -H1 -e0.7 -l1 -s1000 raw_reads.1 raw_reads.1

==10550== HEAP SUMMARY:
==10550==     in use at exit: 2,297,275 bytes in 12 blocks
==10550==   total heap usage: 65 allocs, 53 frees, 14,537,012 bytes allocated
==10550==
==10550== Thread 1:
==10550== 9 bytes in 1 blocks are still reachable in loss record 1 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x53E32B9: strdup (strdup.c:42)
==10550==    by 0x4225B2: Strdup (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x402ACC: main (daligner.c:577)
==10550==
==10550== 12 bytes in 1 blocks are still reachable in loss record 2 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x53E32B9: strdup (strdup.c:42)
==10550==    by 0x422823: Root (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x403A98: main (daligner.c:686)
==10550==
==10550== 40 bytes in 1 blocks are still reachable in loss record 3 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x4224AD: Malloc (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x402BFE: main (daligner.c:598)
==10550==
==10550== 48 bytes in 1 blocks are still reachable in loss record 4 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x4224AD: Malloc (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x40BF36: New_Align_Spec (align.c:242)
==10550==    by 0x403AD0: main (daligner.c:688)
==10550==
==10550== 80 bytes in 1 blocks are still reachable in loss record 5 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x4224AD: Malloc (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x402BDE: main (daligner.c:597)
==10550==
==10550== 117 bytes in 1 blocks are still reachable in loss record 6 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x4C2CA8B: realloc (vg_replace_malloc.c:692)
==10550==    by 0x422A5A: Catenate (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x4238BA: Open_DB (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x4021DD: read_DB (daligner.c:381)
==10550==    by 0x403A58: main (daligner.c:682)
==10550==
==10550== 156 bytes in 1 blocks are still reachable in loss record 7 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x4C2CA8B: realloc (vg_replace_malloc.c:692)
==10550==    by 0x422B5A: Numbered_Suffix (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x40B537: Match_Filter (filter.c:2188)
==10550==    by 0x403F2F: main (daligner.c:736)
==10550==
==10550== 1,400 bytes in 1 blocks are possibly lost in loss record 8 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x423D8E: Open_DB (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x4021DD: read_DB (daligner.c:381)
==10550==    by 0x403A58: main (daligner.c:682)
==10550==
==10550== 66,037 bytes in 1 blocks are possibly lost in loss record 9 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x427459: Read_All_Sequences (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x4024BB: read_DB (daligner.c:439)
==10550==    by 0x403A58: main (daligner.c:682)
==10550==
==10550== 131,072 bytes in 1 blocks are still reachable in loss record 10 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x4224AD: Malloc (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x40C0EE: New_Align_Spec (align.c:267)
==10550==    by 0x403AD0: main (daligner.c:688)
==10550==
==10550== 1,049,152 bytes in 1 blocks are still reachable in loss record 11 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x4224AD: Malloc (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x406153: Sort_Kmers (filter.c:762)
==10550==    by 0x403D38: main (daligner.c:718)
==10550==
==10550== 1,049,152 bytes in 1 blocks are still reachable in loss record 12 of 12
==10550==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10550==    by 0x4224AD: Malloc (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/daligner)
==10550==    by 0x406153: Sort_Kmers (filter.c:762)
==10550==    by 0x403F86: main (daligner.c:741)
==10550==
==10550== LEAK SUMMARY:
==10550==    definitely lost: 0 bytes in 0 blocks
==10550==    indirectly lost: 0 bytes in 0 blocks
==10550==      possibly lost: 67,437 bytes in 2 blocks
==10550==    still reachable: 2,229,838 bytes in 10 blocks
==10550==         suppressed: 0 bytes in 0 blocks

valgrind --leak-check=full --show-leak-kinds=all LAsort -v raw_reads.1.raw_reads.1.C0 raw_reads.1.raw_reads.1.N0 raw_reads.1.raw_reads.1.C1 raw_reads.1.raw_reads.1.N1 raw_reads.1.raw_reads.1.C2 raw_reads.1.raw_reads.1.N2 raw_reads.1.raw_reads.1.C3 raw_reads.1.raw_reads.1.N3

==10692== Warning: set address range perms: large range [0x3a048040, 0x1286fa844) (undefined)
==10692== Warning: set address range perms: large range [0x3a048028, 0x759f4a58) (noaccess)
==10692==
==10692== HEAP SUMMARY:
==10692==     in use at exit: 146 bytes in 2 blocks
==10692==   total heap usage: 50 allocs, 48 frees, 1,000,032,146 bytes allocated
==10692==
==10692== 7 bytes in 1 blocks are still reachable in loss record 1 of 2
==10692==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10692==    by 0x4EBF2B9: strdup (strdup.c:42)
==10692==    by 0x418842: Strdup (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/LAsort)
==10692==    by 0x401425: main (LAsort.c:134)
==10692==
==10692== 139 bytes in 1 blocks are still reachable in loss record 2 of 2
==10692==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10692==    by 0x4C2CA8B: realloc (vg_replace_malloc.c:692)
==10692==    by 0x418CEA: Catenate (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/LAsort)
==10692==    by 0x401759: main (LAsort.c:175)
==10692==
==10692== LEAK SUMMARY:
==10692==    definitely lost: 0 bytes in 0 blocks
==10692==    indirectly lost: 0 bytes in 0 blocks
==10692==      possibly lost: 0 bytes in 0 blocks
==10692==    still reachable: 146 bytes in 2 blocks
==10692==         suppressed: 0 bytes in 0 blocks

valgrind --leak-check=full --show-leak-kinds=all LAmerge -v raw_reads.1 raw_reads.1.raw_reads.1.C0.S raw_reads.1.raw_reads.1.N0.S raw_reads.1.raw_reads.1.C1.S raw_reads.1.raw_reads.1.N1.S raw_reads.1.raw_reads.1.C2.S raw_reads.1.raw_reads.1.N2.S raw_reads.1.raw_reads.1.C3.S raw_reads.1.raw_reads.1.N3.S

==10724== Warning: set address range perms: large range [0x3a048040, 0x1286fa844) (undefined)
==10724== Warning: set address range perms: large range [0x3a048028, 0x1286fa85c) (noaccess)
==10724==
==10724== HEAP SUMMARY:
==10724==     in use at exit: 149 bytes in 2 blocks
==10724==   total heap usage: 33 allocs, 31 frees, 4,000,006,303 bytes allocated
==10724==
==10724== 8 bytes in 1 blocks are still reachable in loss record 1 of 2
==10724==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10724==    by 0x4EBF2B9: strdup (strdup.c:42)
==10724==    by 0x4192C2: Strdup (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/LAmerge)
==10724==    by 0x4018CF: main (LAmerge.c:229)
==10724==
==10724== 141 bytes in 1 blocks are still reachable in loss record 2 of 2
==10724==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
==10724==    by 0x4C2CA8B: realloc (vg_replace_malloc.c:692)
==10724==    by 0x41976A: Catenate (in /home/UNIXHOME/cdunn/repo/gh/FALCON-integrate/DALIGNER/LAmerge)
==10724==    by 0x401CDC: main (LAmerge.c:277)
==10724==
==10724== LEAK SUMMARY:
==10724==    definitely lost: 0 bytes in 0 blocks
==10724==    indirectly lost: 0 bytes in 0 blocks
==10724==      possibly lost: 0 bytes in 0 blocks
==10724==    still reachable: 149 bytes in 2 blocks
==10724==         suppressed: 0 bytes in 0 blocks

thegenemyers / DALIGNER

Crash in free(), but lost test-case #32

include

include

include

include

include "DB.h"

ifdef HIDE_FILES

define PATHSEP "/."

else

define PATHSEP "/"

endif

include

include

include

include

include <sys/stat.h>

include

include "DB.h"

ifdef HIDE_FILES

define PATHSEP "/."

else

define PATHSEP "/"

endif