thegenemyers / DAZZ_DB

The Dazzler Data Base
Other
35 stars 33 forks source link

Pacbio header line name inconsisten #5

Closed homologus closed 9 years ago

homologus commented 9 years ago

Hi Gene,

I came across an error message, while doing fasta2DB and it originated from the following code block.

          if (strcmp(read+(rlen+1),prolog) != 0)
            { fprintf(stderr,"File %s.fasta, Line %d: Pacbio header line name inconsisten\n",
                             core,nline);
              goto error;
            }

Apparently it expects the pacbio IDs to be the same for all reads in a file. Any reason why I cannot mix PacBio reads from multiple sources?

thegenemyers commented 9 years ago

Yes, the program expects that the header part of each entry in a given file is the same. Recall that the DB has the property that it is invertible, i.e., calling DB2fasta on a DB gives on back exactly the .fasta files put in. If each header line has a distinct ID then it would require that the DB store the header line for every entry. This seemed like a waste of space. If you have 3 different sources, then just keep them in 3 different files. I'm not sure why you want to mix them first. Just give the DB each file in turn. Hope that helps. -- Gene

On 12/8/14, 8:02 PM, Homolog.us wrote:

Hi Gene,

I came across an error message, while doing fasta2DB and it originated from the following code block.

if (strcmp(read+(rlen+1),prolog) != 0) { fprintf(stderr,"File %s.fasta, Line %d: Pacbio header line name inconsisten\n", core,nline); goto error; }

Apparently it expects the pacbio IDs to be the same for all reads in a file. Any reason why I cannot mix PacBio reads from multiple sources?

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DAZZ_DB/issues/5.

homologus commented 9 years ago

Thanks. I just found that out by modifying your code, running an alignment and trying to recover back the sequence by DB2fasta !

"I'm not sure why you want to mix them first. "

I got about 30 or so files from different runs. Putting them all together seemed like a bright idea (until it did not) :)

I presume HPCdaligner does not work, if I have 30 databases, right?

P. S. I guess you can close this issue.

thegenemyers commented 9 years ago

Just add each of the 30 files to the data base. Then split the DB however you want with DBsplit, and then run HPCdaligner on the split database. It should all be fine. -- Gene

On 12/9/14, 12:59 AM, Homolog.us wrote:

Thanks. I just found that out by modifying your code, running an alignment and trying to recover back the sequence by DB2fasta !

"I'm not sure why you want to mix them first. "

I got about 30 or so files from different runs. Putting them all together seemed like a bright idea (until it did not) :)

I presume HPCdaligner does not work, if I have 30 databases, right?

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DAZZ_DB/issues/5#issuecomment-66211748.

thegenemyers commented 9 years ago

BTW, with the introduce of .dam's one can input to said .fasta's with different headers, but it is not recommended for data reads, but rather smallish reference data sets.