thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
138 stars 61 forks source link

DASqv error after merging and then splitting las files #70

Closed zkstewart closed 6 years ago

zkstewart commented 6 years ago

Hi,

I have recently been attempting to assemble a genome using HINGE, and due to some misunderstanding of how to best use this assembler, I have run through a pipeline using these commands:

DBsplit -x500 -s100 $PROJNAME
HPC.daligner -T$THREADS -t5 $PROJNAME | csh -v
LAmerge $PROJNAME.tmp.las1 $(find ./ -name "projname.[[:digit:]].las" | sort -V)
LAmerge $PROJNAME.tmp.las2 $(find ./ -name "projname.[[:digit:]][[:digit:]].las" | sort -V)
LAmerge $PROJNAME.tmp.las3 $(find ./ -name "projname.[[:digit:]][[:digit:]][[:digit:]].las" | sort -V)
LAmerge $PROJNAME.las $PROJNAME.tmp*

(I did the three LAmerge commands since I had more than 252 files).

After making this combined las file, I deleted the individual las files and I tried running the assembler. However, HINGE currently attempts to read the whole las file into memory, and since the combined las file size is 940GB, this wasn't working for me. So I then used LAsplit to reduce the file size to more manageable pieces since HINGE can read in las file chunks. I first tried to split the file into a number of files (like below)

LAsplit -v projsplithinge.# 5 < $PROJNAME.las

But would encounter a segmentation fault when the program was done. The individual file sizes roughly added up to the merged file size, so I'm not sure if this error was problematic. From reading another issue, I decided to split the file based on the database (like below)

LAsplit -v projsplithinge.# $PROJNAME.db < $PROJNAME.las

Which got me the original 300 .las files I had as a result of the first pipeline I ran. However, now when I run DASqv (like below)

DASqv -c100 $PROJNAME projsplithinge.*.las

I receive the following error

DASqv: .las file overlaps don't correspond to reads in block 120 of DB

Many of the other .las files seem to complete successfully.

Apologies for the long story, but do you have any idea what this means, or will I need to re-run the HPC.daligner process to get the unadulterated .las files? It took 7.5 days to run (I have about 100x pacbio coverage of a 260-295MB genome), so I'm hesitant to re-run the alignment if there is some sort of fix.

Regards, Zac.

zkstewart commented 6 years ago

I just noticed something important, it appears the problem is not with DASqv, it is instead with the splitting process. The projsplithinge.120.las file is empty for some reason. Also, I thought to mention that I am using the latest commit for the splitting/merging, but not for the alignment (this was performed using the version bundled with HINGE, i.e., 8f179db).

thegenemyers commented 6 years ago

Do you mean empty as in file size is 0, or empty as in it has no overlaps in it (but does have the 12byte prolog)?

Its near impossible to debug something like this from your description.
A small'ish data set that exhibits the problem is the ideal. I am happy to debug given a concrete data set to do so with.

-- Gene

On 10/27/17, 2:30 PM, Zachary Kenneth Stewart wrote:

I just noticed something important, it appears the problem is not with DASqv, it is instead with the splitting process. The projsplithinge.120.las file is empty for some reason. Also, I thought to mention that I am using the latest commit.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/70#issuecomment-339957977, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNjlzzBEJfsa0TwVx9x7UlrCgD5Hqks5swczcgaJpZM4QI-Q_.

zkstewart commented 6 years ago

The file has no overlaps (it has 12 byte prolog), which is unusual since in the original output of HPC.daligner all the files had overlaps. I'm honestly not sure if I'll be able to make a data set that replicates this. I'll play around with the program and see if I can do this. In the meantime I'll close the issue; if someone else encounters something similar they'll at least see that it has happened before (though it's probably somewhat rare that someone merges then splits files again)

Zac.

thegenemyers commented 6 years ago

Actually, I think I may know what the problem is (basically an int quantity is overflowing, needs to be 64-bit to accommodate extreme file size). So please try replacing LAsplit.c with the attached copy and remake the program and let me know if that fixes the problem. If so, then I will update github with the patch.

Cheers,
   Gene

On 10/27/17, 2:18 PM, Zachary Kenneth Stewart wrote:

Hi,

I have recently been attempting to assemble a genome using HINGE, and due to some misunderstanding of how to best use this assembler, I have run through a pipeline using these commands:

DBsplit -x500 -s100 $PROJNAME HPC.daligner -T$THREADS -t5 $PROJNAME csh -v LAmerge $PROJNAME.tmp.las1 $(find ./ -name "projname.[[:digit:]].las" sort -V) LAmerge $PROJNAME.tmp.las2 $(find ./ -name "projname.[[:digit:]][[:digit:]].las" sort -V) LAmerge $PROJNAME.tmp.las3 $(find ./ -name "projname.[[:digit:]][[:digit:]][[:digit:]].las" sort -V) LAmerge $PROJNAME.las $PROJNAME.tmp*

(I did the three LAmerge commands since I had more than 252 files).

After making this combined las file, I deleted the individual las files and I tried running the assembler. However, HINGE currently attempts to read the whole las file into memory, and since the combined las file size is 940GB, this wasn't working for me. So I then used LAsplit to reduce the file size to more manageable pieces since HINGE can read in las file chunks. I first tried to split the file into a number of files (like below)

LAsplit -v projsplithinge.# 5 < $PROJNAME.las

But would encounter a segmentation fault when the program was done. The individual file sizes roughly added up to the merged file size, so I'm not sure if this error was problematic. From reading another issue, I decided to split the file based on the database (like below)

LAsplit -v projsplithinge.# $PROJNAME.db < $PROJNAME.las

Which got me the original 300 .las files I had as a result of the first pipeline I ran. However, now when I run DASqv (like below)

DASqv -c100 $PROJNAME projsplithinge.*.las

I receive the following error

DASqv: .las file overlaps don't correspond to reads in block 120 of DB

Many of the other .las files seem to complete successfully.

Apologies for the long story, but do you have any idea what this means, or will I need to re-run the HPC.daligner process to get the unadulterated .las files? It took 7.5 days to run (I have about 100x pacbio coverage of a 260-295MB genome), so I'm hesitant to re-run the alignment if there is some sort of fix.

Regards, Zac.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/70, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNn-JAwqTLlAka3KzfAzPFl9Z0jlPks5swcn-gaJpZM4QI-Q_.

/*** *

include

include

include

include

include

include <sys/types.h>

include <sys/stat.h>

include "DB.h"

include "align.h"

static char *Usage = "-v ( | <path:db|dam>) < .las";

define MEMORY 1000 // How many megabytes for output buffer

int main(int argc, char argv[]) { char iblock, oblock; FILE output, dbvis; int64 novl, bsize, ovlsize, ptrsize; int parts, tspace, tbytes; int olast, blast; char pwd, root, root2;

int VERBOSE;

// Process options

{ int i, j, k; int flags[128];

ARG_INIT("LAsplit")

j = 1;
for (i = 1; i < argc; i++)
  if (argv[i][0] == '-')
    { ARG_FLAGS("v") }
  else
    argv[j++] = argv[i];
argc = j;

VERBOSE = flags['v'];

if (argc != 3)
  { fprintf(stderr,"Usage: %s %s\n",Prog_Name,Usage);
    exit (1);
  }

}

{ char eptr; int nfiles, cutoff, all; int64 size; char buffer[2MAX_NAME+100];

parts = strtol(argv[2],&eptr,10);
if (*eptr != '\0')
  { pwd   = PathTo(argv[2]);
    if (strcmp(argv[2]+(strlen(argv[2])-4),".dam") == 0)
      root  = Root(argv[2],".dam");
    else
      root  = Root(argv[2],".db");
    dbvis = fopen(Catenate(pwd,"/",root,".dam"),"r");
    if (dbvis == NULL)
      { dbvis = fopen(Catenate(pwd,"/",root,".db"),"r");
        if (dbvis == NULL)
          { fprintf(stderr,"%s: Second argument '%s' is not an integer or a DB\n",
                           Prog_Name,argv[2]);
            exit (1);
          }
      }
    free(pwd);
    free(root);

    if (fscanf(dbvis,DB_NFILE,&nfiles) != 1)
      SYSTEM_ERROR
    while (nfiles-- > 0)
      if (fgets(buffer,2*MAX_NAME+100,dbvis) == NULL)
        SYSTEM_ERROR
    parts = 0;
    if (fscanf(dbvis,DB_NBLOCK,&parts) != 1)
      { fprintf(stderr,"%s: DB %s has not been partitioned\n",Prog_Name,argv[2]);
        exit (1);
      }
    if (fscanf(dbvis,DB_PARAMS,&size,&cutoff,&all) != 3)
      SYSTEM_ERROR
    if (fscanf(dbvis,DB_BDATA,&olast,&blast) != 2)
      SYSTEM_ERROR
  }
else
  { dbvis = NULL;
    if (parts <= 0)
      { fprintf(stderr,"%s: Number of parts is not positive\n",Prog_Name);
        exit (1);
      }
  }

}

ptrsize = sizeof(void ); ovlsize = sizeof(Overlap) - ptrsize; bsize = MEMORY 1000000ll; oblock = (char ) Malloc(bsize,"Allocating output block"); iblock = (char ) Malloc(bsize + ptrsize,"Allocating input block"); if (oblock == NULL || iblock == NULL) exit (1); iblock += ptrsize;

pwd = PathTo(argv[1]); root = Root(argv[1],".las");

root2 = index(root,'#'); if (root2 == NULL) { fprintf(stderr,"%s: No #-sign in source name '%s'\n",Prog_Name,root); exit (1); } if (index(root2+1,'#') != NULL) { fprintf(stderr,"%s: Two or more occurences of #-sign in source name '%s'\n",Prog_Name,root); exit (1); } *root2++ = '\0';

if (fread(&novl,sizeof(int64),1,stdin) != 1) SYSTEM_ERROR if (fread(&tspace,sizeof(int),1,stdin) != 1) SYSTEM_ERROR if (tspace <= TRACE_XOVR && tspace != 0) tbytes = sizeof(uint8); else tbytes = sizeof(uint16);

if (VERBOSE) fprintf(stderr," Distributing %lld la\'s\n",novl);

{ int i; Overlap w; int64 j, low, hgh, last; int64 tsize, povl; char iptr, itop; char optr, *otop;

iptr = iblock;
itop = iblock + fread(iblock,1,bsize,stdin);

hgh = 0;
for (i = 0; i < parts; i++)
  { output = Fopen(Catenate(pwd,"/",Numbered_Suffix(root,i+1,root2),".las"),"w");
    if (output == NULL)
      exit (1);

    low = hgh;
    if (dbvis != NULL)
      { if (fscanf(dbvis,DB_BDATA,&olast,&blast) != 2)
          SYSTEM_ERROR
        last = blast-1;
        hgh  = 0;
      }
    else
      { last = 0;
        hgh  = (novl*(i+1))/parts;
      }

    povl = 0;
    fwrite(&povl,sizeof(int64),1,output);
    fwrite(&tspace,sizeof(int),1,output);

    optr = oblock;
    otop = oblock + bsize;

    for (j = low; j < novl; j++)
      { if (iptr + ovlsize > itop)
          { int64 remains = itop-iptr;
            if (remains > 0)
              memmove(iblock,iptr,remains);
            iptr  = iblock;
            itop  = iblock + remains;
            itop += fread(itop,1,bsize-remains,stdin);
          }

        w = (Overlap *) (iptr-ptrsize);
        if (dbvis == NULL)
          { if (j >= hgh && w->aread > last)
              break;
            last = w->aread;
          }
        else
          { if (w->aread > last)
              break;
          }

        tsize = w->path.tlen*tbytes;
        if (optr + ovlsize + tsize > otop)
          { fwrite(oblock,1,optr-oblock,output);
            optr = oblock;
          }

        memmove(optr,iptr,ovlsize);
        optr += ovlsize;
        iptr += ovlsize;

        if (iptr + tsize > itop)
          { int64 remains = itop-iptr;
            if (remains > 0)
              memmove(iblock,iptr,remains);
            iptr  = iblock;
            itop  = iblock + remains;
            itop += fread(itop,1,bsize-remains,stdin);
          }
    memmove(optr,iptr,tsize);
        optr += tsize;
        iptr += tsize;
      }
    hgh = j;

    if (optr > oblock)
      fwrite(oblock,1,optr-oblock,output);

    rewind(output);
    povl = hgh-low;
    fwrite(&povl,sizeof(int64),1,output);

    if (VERBOSE)
      fprintf(stderr,"  Split off %s: %lld la\'s\n",Numbered_Suffix(root,i+1,root2),povl);

    fclose(output);
  }

}

free(pwd); free(root); free(iblock-ptrsize); free(oblock);

exit (0); }

zkstewart commented 6 years ago

Thanks for looking into the issue further. It will be a while before I can test this patch, since the file I have has been damaged through the process of merging/splitting. For example, this is the stderr from merging the files back into a single file.

LAmerge: Did not write all records to projhinge.las (-5407558162)

Subsequent splitting of the file results in individual files approximately half the size of what they should be. I am currently running a new HPC.daligner job which should finish in approximately 100 hours. I will make a copy of that file to run some tests splitting and merging with your patch and let you know how it goes.

thegenemyers commented 6 years ago

Very good. Do let me know -- normally its impossible to debug something without a test data set as I requested, but then I though about what you were describing and realized a simple explanation would be that some variable or other was overflowing as it was a 32-bit int when it should be a 64-bit int. So I reviewed the code and found one such variable. So I keep my fingers crossed that this guess is indeed the problem. -- Gene

On 10/31/17, 2:01 AM, Zachary Kenneth Stewart wrote:

Thanks for looking into the issue further. It will be a while before I can test this patch, since the file I have has been damaged through the process of merging/splitting. For example, this is the stderr from merging the files back into a single file.

LAmerge: Did not write all records to projhinge.las (-5407558162)

Subsequent splitting of the file results in individual files approximately half the size of what they should be. I am currently running a new HPC.daligner job which should finish in approximately 100 hours. I will make a copy of that file to run some tests splitting and merging with your patch and let you know how it goes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/70#issuecomment-340628875, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNpGtlctTBUATkwlV9ExEEfZ4Y_5wks5sxnFQgaJpZM4QI-Q_.

zkstewart commented 6 years ago

Hi again,

The alignment run completed and I was able to successfully merge the 300 individual .las files into a single file, then split it into 10 separate files using the patch you provided for LAsplit. Thanks for looking into the issue!

Zac.

thegenemyers commented 6 years ago

Yes, that won't work because DASqv is expecting the .las files to be split into blocks in the same way as the database. I would recommend you replist the database into 10 pieces, instead of its current 300, then merge your 10 files together again, and then split them into 10 pieces according to the database partition (see LAsplit documentation). Then DASqv will work on each (big) .las file. -- Gene

On 11/7/17, 10:35 AM, Zachary Kenneth Stewart wrote:

Hi again,

The alignment run completed and I was able to successfully merge the 300 individual .las files into a single file, then split it into 10 separate files using the patch you provided for LAsplit. Thanks for looking into the issue!

Zac.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/70#issuecomment-342426197, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNttPN6dEVlUe2XtEWBSOtJBdou_Eks5s0CRMgaJpZM4QI-Q_.