thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
138 stars 61 forks source link

LAcheck: Too many alignment records #50

Closed pb-cdunn closed 7 years ago

pb-cdunn commented 7 years ago
$ LAcheck -v raw_reads raw_reads.195.raw_reads.337.C1.las
  raw_reads.195.raw_reads.337.C1: Too many alignment records

The subreads DB is 45GB. If you want it, I could try to make it available for download. Any guesses what could cause this? We are not yet using repeat-masking. Would that be your best recommendation?

pb-cdunn commented 7 years ago

This is still happening. We are now trying to skip .las files which fail LAcheck, since we so often have at least one failure.

thegenemyers commented 7 years ago

This error occurs when the file contains fewer overlaps then the header of the file says should be there.

Are you using the latest version of daligner which uses /tmp to do an initial sort and merge of all the thread files so that there is only one output file per block pair?

That's been the only substantive change. What could be happening is that your /tmp area isn't big enough on your nodes so the later part of the file fails and you get a truncated .las file. I notice that daligner checks that it can open the /tmp files but I did not put in checks for failed writes one the /tmp files are being written to. I should probably change this and put in the checks, especially if you report back that this is the problem.

Please advise, Gene

On 11/12/16, 1:39 AM, Christopher Dunn wrote:

$ LAcheck -v raw_reads raw_reads.195.raw_reads.337.C1.las raw_reads.195.raw_reads.337.C1: Too many alignment records

The subreads DB is 45GB. If you want it, I could try to make it available for download. Any guesses what could cause this? We are not yet using repeat-masking. Would that be your best recommendation?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/50, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNjcwbpb_96zS3Frz2Xqoi-UAzmCnks5q9QragaJpZM4KwRQh.

pb-cdunn commented 7 years ago

No, we are not running the very latest code. We are close to this:

commit 8f179db7fb0bf59f34c4a073b85137a592336d89
Author: thegenemyers <gene.myers@gmail.com>
Date:   Sat Aug 6 06:25:10 2016

    Bug fix in "Entwine" manifesting as trace point errors

    About a month ago introduced a bug when improving handling of redundant
    LAs within daligner.  Fixed.

That seems to differ from latest only in /tmp handling.

I'll pass along more info as I learn more.

thegenemyers commented 7 years ago

Hmm, OK. I'm more than willing to help/debug as datasets or information becomes available. -- Gene

On 11/19/16, 3:48 PM, Christopher Dunn wrote:

No, we are not running the very latest code. We are close to this:

|commit 8f179db7fb0bf59f34c4a073b85137a592336d89 Author: thegenemyers gene.myers@gmail.com Date: Sat Aug 6 06:25:10 2016

 Bug fix in "Entwine" manifesting as trace point errors

 About a month ago introduced a bug when improving handling of redundant
 LAs within daligner.  Fixed.

I'll pass along more info as I learn more.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/50#issuecomment-261718045, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNn-YT-J8qDEOWR0wK0zOcJuTZKPiks5q_ww5gaJpZM4KwRQh.

thegenemyers commented 7 years ago

Chris, Did you ever resolve this? Going through issues and see this one is still open. -- Gene

pb-cdunn commented 7 years ago

I'm not sure of the status of our daligner crashes. I haven't been allowed to work on that for some time. I'll re-open if I see it again.

We do not use the /tmp code. If daligner relied on $TMPDIR, we might then be able to use it, since we have far more space in /scratch than in /tmp. But really, I like your preference for point-tools plus separate integration programs (e.g. HPC.daligner). We have been running in a temp-dir since well before your change. We parse the output of HPC.daligner and re-group the commands. We will probably replace HPC.daligner someday, but we would rather not have to replace the daligner executable itself.

thegenemyers commented 7 years ago

My understanding it that what you would prefer is that instead of having modified daligner to use /tmp, I should have modified HPC.daligner to produce scripts that did so. I'll think about it. -- Gene

On 2/26/17, 8:06 PM, Christopher Dunn wrote:

I'm not sure of the status of our daligner crashes. I haven't been allowed to work on that for some time. I'll re-open if I see it again.

We do not use the |/tmp| code. If daligner relied on |$TMPDIR|, we might then be able to use it, since we have far more space in |/scratch| than in |/tmp|. But really, I like your preference for point-tools plus separate integration programs (e.g. HPC.daligner). We have been running in a temp-dir since well before your change. We parse the output of HPC.daligner and re-group the commands. We will probably replace HPC.daligner someday, but we would rather not have to replace the |daligner| executable itself.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/50#issuecomment-282578605, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNnqVousYS3Ualv4lVEXncFOJQ_iDks5rgc1SgaJpZM4KwRQh.

pb-cdunn commented 7 years ago

Yes, and only if a --tmpdir flag is passed (so we do not need to change our parser immediately).

Job-distribution is a very hard problem, highly dependent upon specific user environments, so we will probably never be able to use HPC.daligner exactly as intended. But I do understand the reason for running in /tmp, including the initial sort/merge.

I have another idea for you, but I will open a separate Issue for that presently...

thegenemyers commented 7 years ago

So what I will contemplate is backing out of having daligner use /tmp, but rather have it take a directory path where it will write the .N# and .C# files. Then a new HPC.daligner script generator will generate daligner "jobs" that run daligner directing the .N and .C files to a parameterizable path (by default \tmp) and then follow with the sort and merge of said (in the same job) placing the resulting block pair .las back on the distibuted file system.

If I did this would you use it thusly?

-- Gene

On 2/26/17, 8:06 PM, Christopher Dunn wrote:

I'm not sure of the status of our daligner crashes. I haven't been allowed to work on that for some time. I'll re-open if I see it again.

We do not use the |/tmp| code. If daligner relied on |$TMPDIR|, we might then be able to use it, since we have far more space in |/scratch| than in |/tmp|. But really, I like your preference for point-tools plus separate integration programs (e.g. HPC.daligner). We have been running in a temp-dir since well before your change. We parse the output of HPC.daligner and re-group the commands. We will probably replace HPC.daligner someday, but we would rather not have to replace the |daligner| executable itself.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/50#issuecomment-282578605, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNnqVousYS3Ualv4lVEXncFOJQ_iDks5rgc1SgaJpZM4KwRQh.

pb-cdunn commented 7 years ago

We parse the output of HPC.daligner to create a script for each daligner "job", which we then distribute. We also parse to generate scripts for merging. I guess I can't say how practical it will be to parse your new output until I see it. What we have works for now.