marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

purgeOverlaps=aggressive trying to remove shared stageDirectory #1492

Closed BioNica closed 4 years ago

BioNica commented 5 years ago

Dear developers, (sorry for the long post)

I'm using canu to assembly a 2.5 G mammal genome. It is running on a cluster with slurm, but at the time I'm using a single fat node (144 CPUs and 3024 gigabytes of memory). I'm still in the correction step.

After running out of space several times (with canu release v1.8) and dying during 'cormhap' and 'ovB', I followed the suggestions from Skoren (https://github.com/marbl/canu/issues/1465) , installed what I believe would be the version 1.9 (actually cloning the git version - Canu snapshot v1.8 +358, because the latest release distributed as .tar.gz is 1.8), and prepared the spec file as follows:

useGrid=false
stageDirectory=/scr/${USER}-${SLURM_JOB_ID}
gnuplotImageFormat=png
corMhapFilterThreshold=0.0000000002 
corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" 
mhapMemory=60g 
mhapBlockSize=500 
ovlMerDistinct=0.975
corMaxEvidenceErate=0.15 
purgeOverlaps=aggressive

Now, it was able to finish 'cormhap', 'ovB' and 'ovS' using markedly less space at the storage, in less than 3 days. Than it failed at the 'cor' step, after 2 tries.

[...]
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'cor' concurrent execution on Mon Sep 23 10:59:35 2019 with 21376.635 GB free disk space (103 processes; 36 concurrently)

    cd correction/2-correction
    ./correctReads.sh 1 > ./correctReads.000001.out 2>&1
    ./correctReads.sh 2 > ./correctReads.000002.out 2>&1
[...]

-- Read correction jobs failed, tried 2 times, giving up.
--   job 2-correction/results/0001.cns FAILED.
--   job 2-correction/results/0002.cns FAILED.
--   job 2-correction/results/0003.cns FAILED.
[...]
--   job 2-correction/results/0103.cns FAILED.
--

ABORT:
ABORT: Canu snapshot v1.8 +358 changes (r9568 c478f9e441e76326ea1435adedf7f01997bbd32f)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

real    3883m7.381s
user    390986m14.224s
sys     20368m20.326s

At this time the error was not related to storage quota. The only error messages I could find were recorded at the correctReads output and correction/2-correction/results/

files correction/2-correction/results/0001.err up to 0067.err

-- Opening seqStore '/scr/nicole-7015/PacBio.seqStore'.
sqStore_loadMetadata()-- Using 'raw' 0x01 reads.
Failed to open '/scr/nicole-7015/PacBio.seqStore/reads-rawu' for reading: No such file or directory

while files from correction/2-correction/results/0068.err to 0103.err say they cannot even find the store:

-- Opening seqStore '/scr/nicole-7015/PacBio.seqStore'.
sqStore()--  failed to open '/scr/nicole-7015/PacBio.seqStore' for read-only access: store doesn't exist.

files: correction/2-correction/correctReads.000001.out (up to correctReads.000069.out)

Found perl:
   /usr/bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /usr/bin/java
   openjdk version "1.8.0_191"

Found canu:
   /data04/tools/PacificBiosciences/canu-1.9/canu/Linux-amd64/bin/canu
   Canu snapshot v1.8 +358 changes (r9568 c478f9e441e76326ea1435adedf7f01997bbd32f)

Running job 1 based on command line options.
Start copy at Seg Set 23 11:03:52 -03 2019
Finished at Seg Set 23 11:05:09 -03 2019
rm: cannot remove ‘/scr/nicole-7015/PacBio.seqStore’: Directory not empty
rmdir: failed to remove ‘/scr/nicole-7015’: Directory not empty

files from correctReads.000070.out to correctReads.000103.out (last) differ slightly, including errors at copy as these:

[...]
Running job 103 based on command line options.
Start copy at Seg Set 23 11:07:18 -03 2019
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0010’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0011’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0012’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0013’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0014’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0015’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0016’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0017’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7015/PacBio.seqStore/blobs.0018’: No such file or directory
Finished at Seg Set 23 11:07:46 -03 2019
rmdir: failed to remove ‘/scr/nicole-7015’: No such file or directory

I'm not sure if this is enough to diagnose what could be going wrong, but I feel that removing exactly the stageDirectory, would not be the desired action, given that it is shared by all other concurrent processes.

I have found exactly this action coded at the script correction/2-correction/correctReads.sh

[...]
seqStore="../../PacBio.seqStore"

if [ ! -d /scr/${USER}-${SLURM_JOB_ID} ] ; then
  mkdir -p /scr/${USER}-${SLURM_JOB_ID}
fi

mkdir -p /scr/${USER}-${SLURM_JOB_ID}/PacBio.seqStore

echo Start copy at `date`
cp -p $seqStore/info      /scr/${USER}-${SLURM_JOB_ID}/PacBio.seqStore/info
cp -p $seqStore/libraries /scr/${USER}-${SLURM_JOB_ID}/PacBio.seqStore/libraries
cp -p $seqStore/reads     /scr/${USER}-${SLURM_JOB_ID}/PacBio.seqStore/reads
cp -p $seqStore/blobs.*   /scr/${USER}-${SLURM_JOB_ID}/PacBio.seqStore/
echo Finished   at `date`

seqStore="/scr/${USER}-${SLURM_JOB_ID}/PacBio.seqStore"

$bin/falconsense \
  -S $seqStore \
  -C ../PacBio.corStore \
  -R ./PacBio.readsToCorrect \
  -r $bgnid-$endid \
  -t  4 \
  -cc 4 \
  -cl 1000 \
  -oi 0.7 \
  -ol 500 \
  -p ./results/$jobid.WORKING \
  -cns \
  > ./results/$jobid.err 2>&1 \
&& \
mv ./results/$jobid.WORKING.cns ./results/$jobid.cns \

rm -rf /scr/${USER}-${SLURM_JOB_ID}/PacBio.seqStore
rmdir  /scr/${USER}-${SLURM_JOB_ID}

exit 0

Maybe this happens to be a problem of running all concurrent processes in the same compute node. According to this code, each process will attempt to create, write at, and remove the same single directory. If I had one process for node, I would have one stageDirectory for process, and it would be created and removed at each step without affecting other processes.

One possible work around could be including the number of the partition (local variable $jobid) in the name of the directory as follows:

if [ ! -d /scr/${USER}-${SLURM_JOB_ID}-${jobid} ] ; then
  mkdir -p /scr/${USER}-${SLURM_JOB_ID}-${jobid}
fi

mkdir -p /scr/${USER}-${SLURM_JOB_ID}-${jobid}/PacBio.seqStore

echo Start copy at `date`
cp -p $seqStore/info      /scr/${USER}-${SLURM_JOB_ID}-${jobid}/PacBio.seqStore/info
cp -p $seqStore/libraries /scr/${USER}-${SLURM_JOB_ID}-${jobid}/PacBio.seqStore/libraries
cp -p $seqStore/reads     /scr/${USER}-${SLURM_JOB_ID}-${jobid}/PacBio.seqStore/reads
cp -p $seqStore/blobs.*   /scr/${USER}-${SLURM_JOB_ID}-${jobid}/PacBio.seqStore/
echo Finished   at `date`

seqStore="/scr/${USER}-${SLURM_JOB_ID}-${jobid}/PacBio.seqStore"

$bin/falconsense \
  -S $seqStore \
  -C ../PacBio.corStore \
  -R ./PacBio.readsToCorrect \
  -r $bgnid-$endid \
  -t  4 \
  -cc 4 \
  -cl 1000 \
  -oi 0.7 \
  -ol 500 \
  -p ./results/$jobid.WORKING \
  -cns \
  > ./results/$jobid.err 2>&1 \
&& \
mv ./results/$jobid.WORKING.cns ./results/$jobid.cns \

rm -rf /scr/${USER}-${SLURM_JOB_ID}-${jobid}/PacBio.seqStore
rmdir  /scr/${USER}-${SLURM_JOB_ID}-${jobid}

I edited the script and ran canu correct again. I commented the remove lines to be able to verify the temporary files. Again, I found a similar error at correction/2-correction/results/0001.err

-- Opening seqStore '/scr/nicole-7029-0001/PacBio.seqStore'.
sqStore_loadMetadata()-- Using 'raw' 0x01 reads.
Failed to open '/scr/nicole-7029-0001/PacBio.seqStore/reads-rawu' for reading: No such file or directory

I looked at the files at this directory, and no file or directory named reads-rawu exists.

$ ls -hs /scr/nicole-7029-0001/PacBio.seqStore/
total 19G
1,1G blobs.0000  1,1G blobs.0006  1,1G blobs.0012  134M blobs.0018
1,1G blobs.0001  1,1G blobs.0007  1,1G blobs.0013  4,0K info
1,1G blobs.0002  1,1G blobs.0008  1,1G blobs.0014  4,0K libraries
1,1G blobs.0003  1,1G blobs.0009  1,1G blobs.0015   79M reads
1,1G blobs.0004  1,1G blobs.0010  1,1G blobs.0016
1,1G blobs.0005  1,1G blobs.0011  1,1G blobs.0017

It may be a typo somewhere else in the code.

I'm stuck here and I do not know what to try next. It may be something that is notting todo with all that I have tried.

If you have ever seen something similar, please let me know. Thank you in advance!

skoren commented 5 years ago

I don't think you're on the 1.9 branch, what does canu -version report? It should have 1.9 in the name, if not you're on the tip branch instead. If you are on tip, make sure you run:

git checkout v1.9
rm -r Linux-amd64
make

and restart an assembly from scratch.

BioNica commented 5 years ago

Thank you @skoren for your observation. Indeed, I was using Canu snapshot v1.8 +358 instead of v1.9 (lack of experience with git). I have now installed v1.9 following your instructions. I will start it over again, and see how far we get it now.

BioNica commented 4 years ago

Dear @skoren,

As you suggested, I have installed a new snapshot of v1.9 [Canu branch v1.9 +372 changes (r9582 aed1d9016cb37eff92303bbae85a729130ebe385)] and I restarted the assembly from scratch with the same parameters. Again, it failed at the same step. I restarted it, and it still presents the same error messages.

[...]
-- Running jobs.  Second attempt out of 2.
----------------------------------------
-- Starting 'cor' concurrent execution on Tue Oct 15 17:33:24 2019 with 21082.496 GB free disk space (103 processes
; 36 concurrently)

    cd correction/2-correction
    ./correctReads.sh 1 > ./correctReads.000001.out 2>&1
[...]
    ./correctReads.sh 103 > ./correctReads.000103.out 2>&1

-- Finished on Tue Oct 15 17:37:03 2019 (219 seconds) with 21082.496 GB free disk space
----------------------------------------
--
-- Read correction jobs failed, tried 2 times, giving up.
--   job 2-correction/results/0001.cns FAILED.
--   job 2-correction/results/0002.cns FAILED.
[...]
--   job 2-correction/results/0101.cns FAILED.
--   job 2-correction/results/0102.cns FAILED.
--   job 2-correction/results/0103.cns FAILED.
--

ABORT:
ABORT: Canu branch v1.9 +372 changes (r9582 aed1d9016cb37eff92303bbae85a729130ebe385)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

And the logs indicate the same problem in removing temporary files.

$ cat correction/2-correction/correctReads.000001.out

Found perl:
   /usr/bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /usr/bin/java
   openjdk version "1.8.0_191"

Found canu:
   /data04/tools/PacificBiosciences/canu-1.9/canu/Linux-amd64/bin/canu
   Canu branch v1.9 +372 changes (r9582 aed1d9016cb37eff92303bbae85a729130ebe385)

Running job 1 based on command line options.
Start copy at Ter Out 15 17:33:25 -03 2019
Finished at Ter Out 15 17:35:01 -03 2019
rm: cannot remove ‘/scr/nicole-7342/PacBio.seqStore’: Directory not empty
rmdir: failed to remove ‘/scr/nicole-7342’: Directory not empty

$ cat correction/2-correction/correctReads.000050.out 
[...]
Running job 50 based on command line options.
Start copy at Ter Out 15 17:34:57 -03 2019
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/blobs.0012’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/blobs.0013’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/blobs.0014’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/blobs.0015’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/blobs.0016’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/blobs.0017’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/blobs.0018’: No such file or directory
Finished at Ter Out 15 17:36:28 -03 2019
rmdir: failed to remove ‘/scr/nicole-7342’: No such file or directory

$ cat correction/2-correction/correctReads.000060.out 
[...]
Running job 60 based on command line options.
Start copy at Ter Out 15 17:35:22 -03 2019
cp: target ‘/scr/nicole-7342/PacBio.seqStore/’ is not a directory
Finished at Ter Out 15 17:35:22 -03 2019
rmdir: failed to remove ‘/scr/nicole-7342’: No such file or directory

$ cat correction/2-correction/correctReads.000103.out 
[...]
Running job 103 based on command line options.
Start copy at Ter Out 15 17:36:55 -03 2019
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/libraries’: No such file or directory
cp: cannot create regular file ‘/scr/nicole-7342/PacBio.seqStore/reads’: No such file or directory
cp: target ‘/scr/nicole-7342/PacBio.seqStore/’ is not a directory
Finished at Ter Out 15 17:36:55 -03 2019
rmdir: failed to remove ‘/scr/nicole-7342’: No such file or directory

As before, the script correctReads.sh removes the shared temporary directory, affecting the concurrent jobs in the same node. Maybe, if you increase the directory tree with a temporary sub-directory for each process (say 000001 to 000103) and deleting just these at the end of the process, this error could be solved.

skoren commented 4 years ago

Canu is designed assuming the scratch directory is per-process not per node so each job creates and removes the folder itself. This is documented with the command: https://canu.readthedocs.io/en/latest/parameter-reference.html#file-staging.

The problem is you're running locally because you've set useGrid=false . Staging is designed for having multiple jobs running on different hosts which avoids concurrent access to a slow NFS system. In your case, you should run without staging (you can run Canu completely within scratch on the node or manually copy seqStore to scratch and symlink to it). Better yet, don't use useGrid=false or it will take a while to assemble a large genome (weeks).

BioNica commented 4 years ago

Thank you for the explanation @skoren . Our cluster has an heterogeneous infrastructure and is shared by several groups. To avoid monopolizing all available nodes, I have started it in the fat node alone. To avoid the stage directory and run it completely within scratch, I could enter the node with srun in an interactive shell and then start it directly, without sbatch, hoping the 3TB scratch is enough for the run. Would you recommend other method? Are there already configuration methods to limit the number of nodes canu can recruit?

skoren commented 4 years ago

Re scratch: Is there a reason you can't hold scratch through sbatch? If you have 3tb scratch it should be sufficient for the assembly to run, I'd just suggest running it in stages (-correct, -trim, -assemble) and saving the results from each at the end so you don't lose the files when sbatch releases?

Re limiting nodes, yes you can add gridEngineResourceOption="--cpus-per-task=THREADS%XX --mem-per-cpu=MEMORY" which will limit concurrent jobs. You can also add maxThreads=XX to limit the max cores per job, If you set maxThreads=16 gridEngineResourceOption="--cpus-per-task=THREADS%10 --mem-per-cpu=MEMORY" you'd limit canu to 160 cores.

BioNica commented 4 years ago

In my first try, canu produced more than 5 TB during correction before dying. At the moment (2-correction stage) my storage consumption is:

$ du -h --max-depth=2 
3,9M    ./canu-logs
0   ./canu-scripts
2,9G    ./correction/0-mercounts
72M ./correction/1-overlapper
1,5T    ./correction/PacBio.ovlStore
391M    ./correction/2-correction
6,3G    ./correction/PacBio.corStore
1,5T    ./correction
19G ./PacBio.seqStore
1,6T    .

If I understand your suggestion, I could copy ./PacBio.seqStore to scratch and rerun (directly in the scratch) from the step it stopped. Right? All other directories I could symlink to scratch.

PacBio.seqStore
PacBio.seqStore.err
PacBio.seqStore.ssi
PacBio.report -> /data/results/Canu_assembly/PacBio.report
canu-logs -> /data/results/Canu_assembly/canu-logs
canu-scripts -> /data/results/Canu_assembly/canu-scripts
correction -> /data/results/Canu_assembly/correction

I will try again.

skoren commented 4 years ago

Right but your previous out of space was before you added the purging and additional mhap options, it fits into 1.5 tb now and this is peak space usage, you could actually remove the ovlStore completely (replace it with an empty directory) and move everything to scratch which will certainly fit in 3tb of space.

Remove the correction.sh script in 2-results and re-launch in scratch with the same canu command but omit the staging options, that'll write the correction results to scratch directly and generate a correctedReads.fasta.gz file.

skoren commented 4 years ago

Issue is staging not purging related, staging requires unique folder names to work. Resolve either by running in scratch space on a single node or running with staging across cluster nodes.

Original issue with too much disk usage seems resolved with updated parameters and purge option (>5tb down to <1.5).

BioNica commented 4 years ago

Dear @skoren , sorry for reopening this thread. I wanted to give a feedback. Following your advice, I restarted Canu from inside the compute node, running directly on scratch. It was able to finish 'cor', but then died on low read coverage, what I found really strange, given that it was 30x at the beginning.

-- In 'PacBio.seqStore', found PacBio reads:
--   Raw:        5159463
--   Corrected:  0
--   Trimmed:    0
--
-- Generating assembly 'PacBio' in '/scr/nicole/BrachyPacBio_v1.9'
--
-- Parameters:
--
--  genomeSize        2500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
-- Set corMinCoverage=4 based on read coverage of 30.
-- Using overlaps no worse than 0.3 fraction error for correcting reads (from corErrorRate parameter).
--
-- Configuring correction jobs:
--   Jobs limited to 24 GB per job (via option corMemory).
--   Reads estimated to need at most 1.67 GB for computation.
--   Leaving 22.33 GB memory for read data.
----------------------------------------

[...]
    ./correctReads.sh 102 > ./correctReads.000102.out 2>&1
    ./correctReads.sh 103 > ./correctReads.000103.out 2>&1

-- Finished on Tue Oct 22 17:41:53 2019 (73764 seconds) with 22404.295 GB free disk space
----------------------------------------
-- Found 103 read correction output files.
-- No change in report.
-- Finished stage 'cor-generateCorrectedReadsCheck', reset canuIteration.
-- Found 103 read correction output files.
-- No change in report.
-- Finished stage 'cor-generateCorrectedReadsCheck', reset canuIteration.
--
-- Loading corrected reads into corStore and seqStore.
----------------------------------------
-- Starting command on Tue Oct 22 17:41:53 2019 with 22404.294 GB free disk space

    cd correction
    /data04/tools/PacificBiosciences/canu-1.9/canu/Linux-amd64/bin/loadCorrectedReads \
      -S ../PacBio.seqStore \
      -C ./PacBio.corStore \
      -L ./2-correction/corjob.files \
    >  ./PacBio.loadCorrectedReads.log \
    2> ./PacBio.loadCorrectedReads.err

-- Finished on Tue Oct 22 17:50:36 2019 (523 seconds) with 22359.684 GB free disk space
----------------------------------------
--
-- ERROR:  Read coverage (0) is too low to be useful.
-- ERROR:
-- ERROR:  This could be caused by an incorrect genomeSize or poor quality reads that could not
-- ERROR:  be sufficiently corrected.
-- ERROR:
-- ERROR:  You can force Canu to continue by decreasing parameter stopOnLowCoverage=10,
-- ERROR:  however, the quality of corrected reads and/or contiguity of contigs will be poor.
-- 

Could I have lost important information by deleting the previous ovlStore?

skoren commented 4 years ago

Nope, the ovl store wouldn't affect this, it seems more like the corrected read loading failed, the 0 coverage is indeed suspicious. Do you have the PacBio.loadCorrectedReads.* log files, if so post them here. Do you still have the 2-correction/results folder from before you copied to scratch, what are the file sizes in there? Also what does sqStoreDumpMetaData -r 1-10 -S *.seqStore report?

BioNica commented 4 years ago

Hello @skoren ,

I still have the PacBio.loadCorrectedReads.* files. The PacBio.loadCorrectedReads.log is empty and the PacBio.loadCorrectedReads.err contains the line:

Failed to open './2-correction/corjob.files' for reading: No such file or directory

But there was a PacBio_v1.9/correction/2-correction/corjob.files

2-correction/results/0001.cns
2-correction/results/0002.cns
2-correction/results/0003.cns
2-correction/results/0004.cns
[...]
2-correction/results/0100.cns
2-correction/results/0101.cns
2-correction/results/0102.cns
2-correction/results/0103.cns

My 2-correction/results points to the storage, and has been overridden :-( Still, the files have been created again:

total 124G
1,2G 0001.cns
3,8M 0001.err
1,1G 0002.cns
3,8M 0002.err
[...]
1,2G 0101.cns
3,8M 0101.err
1,2G 0102.cns
3,8M 0102.err
336M 0103.cns
1,1M 0103.err

The contents of the files:

$ head  -20 PacBio_v1.9/correction/2-correction/results/0103.err 
-- Opening seqStore '../../PacBio.seqStore'.
sqCache: found 5159463 raw reads with 76446968024 bases.
-- Opening corStore '../PacBio.corStore' version 1.
-- Loading list of reads to process from './PacBio.readsToCorrect'.
    read    read evidence     corrected
      ID  length    reads       regions
-------- ------- -------- ------------- ...
 5146233   19467       16   4167-13971 ( 11970) memory act  156925952 est  700963148 act/est 22.39
 5146234   14023       33    883-12248 ( 12412) memory act    3522560 est  654148580 act/est 0.54
 5146235   11298       19    228-11219 ( 11054) memory act    1261568 est  628194188 act/est 0.20
 5146236    8308       28      1-8281  (  8294) memory act     462848 est  602031200 act/est 0.08
 5146237   28210       36    820-8127     8553-10678   13820-28783 ( 15031) memory act   78774272 est  783101822 act/est 10.06
 5146238   29644       65      0-29528 ( 29631) memory act   20582400 est  804353504 act/est 2.56
 5146239    2476        7      0-0     (   108) memory act      20480 est  546548000 act/est 0.00
 5146240   20878       41      0-20810 ( 20865) memory act    1671168 est  720499694 act/est 0.23
 5146241   41816       74      0-41250 ( 41803) memory act  113299456 est  917618840 act/est 12.35
 5146242   19822       27     30-19587 ( 19779) memory act     233472 est  705593318 act/est 0.03
 5146243    4329        6     41-907   (   875) memory act          0 est  563229956 act/est 0.00
 5146245    3857       17      0-0     (   143) memory act          0 est  558937250 act/est 0.00
 5146246   40051       51    943-36769 ( 39243) memory act    6021120 est  898384694 act/est 0.67
[...]

The .cns files are binary.

Does this information help?

skoren commented 4 years ago

That is certainly very strange. The missing file just lists the cns files. What does ls -lha /PacBio_v1.9/PacBio.seqStore folder give?

Once we figure out the files there, you can try manually backing up and re-trying the step, it'll be something like (but don't run it until we get the files to remove from the above):

ls -lha `pwd`/PacBio_v1.9/correction/2-correction/results/*.cns |awk '{print $NF}' > `pwd`/PacBio_v1.9/correction/2-correction/corjob.files
mv PacBio_v1.9/PacBio.seqStore/version1/* PacBio_v1.9/PacBio.seqStore/
rm  PacBio_v1.9/PacBio.seqStore/<extraneous files made by the previous attempt>
cd PacBio_v1.9/correction
    /data04/tools/PacificBiosciences/canu-1.9/canu/Linux-amd64/bin/loadCorrectedReads \
      -S ../PacBio.seqStore \
      -C ./PacBio.corStore \
      -L ./2-correction/corjob.files     >  ./PacBio.loadCorrectedReads.log \
    2> ./PacBio.loadCorrectedReads.err
 /data04/tools/PacificBiosciences/canu-1.9/canu/Linux-amd64/bin/sqStoreDumpMetaData -S  ../PacBio.seqStore -reads -r 1-10```
BioNica commented 4 years ago

Dear @skoren , sorry for taking so long to answer. I was not here last week.

Apparently, it was not using the folder I copied to the scratch, because it has not been updated:

$ ls -hla /scr/nicole/PacBio_v1.9/PacBio.seqStore
total 19G
drwxrwxr-x 2 nicole nicole 4,0K Out 11 19:33 .
drwxrwxr-x 3 nicole nicole  191 Out 22 17:41 ..
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:20 blobs.0000
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:21 blobs.0001
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:22 blobs.0002
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:22 blobs.0003
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:23 blobs.0004
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:24 blobs.0005
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:24 blobs.0006
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:25 blobs.0007
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:26 blobs.0008
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:26 blobs.0009
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:27 blobs.0010
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:28 blobs.0011
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:29 blobs.0012
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:29 blobs.0013
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:30 blobs.0014
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:31 blobs.0015
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:31 blobs.0016
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:32 blobs.0017
-rw-rw-r-- 1 nicole nicole 228M Out 11 19:32 blobs.0018
-rw-rw-r-- 1 nicole nicole  89M Out 11 19:32 errorLog
-rw-rw-r-- 1 nicole nicole   88 Out 11 19:32 info
-rw-rw-r-- 1 nicole nicole  459 Out 11 19:32 info.txt
-rw-rw-r-- 1 nicole nicole  336 Out 11 19:32 libraries
-rw-rw-r-- 1 nicole nicole  198 Out 11 19:32 libraries.txt
-rw-rw-r-- 1 nicole nicole  368 Out 11 19:32 load.dat
-rw-rw-r-- 1 nicole nicole  28M Out 11 19:33 readlengths-cor.dat
-rw-rw-r-- 1 nicole nicole  377 Out 11 19:33 readlengths-cor.gp
-rw-rw-r-- 1 nicole nicole  20K Out 11 19:33 readlengths-cor.png
-rw-rw-r-- 1 nicole nicole 240M Out 11 19:32 readNames.txt
-rw-rw-r-- 1 nicole nicole 197M Out 11 19:32 reads

Now I followed your instructions above and rebuilt the corjob.files with the absolute path. I ran the script passing the absolute path to the files on scratch, instead of the relative path:

 nohup /data04/tools/PacificBiosciences/canu-1.9/canu/Linux-amd64/bin/loadCorrectedReads  \
 -S /scr/nicole/PacBio_v1.9/PacBio.seqStore  \
 -C /scr/nicole/PacBio_v1.9/correction/PacBio.corStore  \
 -L /scr/nicole/PacBio_v1.9/correction/2-correction/corjob.files > /scr/nicole/PacBio_v1.9/correction/PacBio.loadCorrectedReads.log  \
 2> /scr/nicole/PacBio_v1.9/correction/PacBio.readsToCorrect.err  |& tee log_Canu_scratch_20191104.log 

As result:

$  /data04/tools/PacificBiosciences/canu-1.9/canu/Linux-amd64/bin/sqStoreDumpMetaData -S  PacBio.seqStore -reads -r 1-10
    readID  libraryID     seqLen     rawLen     corLen   clearBgn   clearEnd  segm        byte  part      flags
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----- ----------- ----- ----------
         1          1          0       1058          0          0          0     0           0    0        -C-
         2          1          0       2255          0          0          0     0         352    0        -C-
         3          1          0       2119          0          0          0     0        1000    0        -C-
         4          1       2342       2411       2342          0          0    19           0    0        -C-
         5          1       3249       4274       3249          0          0    19        1300    0        -C-
         6          1          0       1396          0          0          0     0        3464    0        -C-
         7          1          0       2118          0          0          0     0        3900    0        -C-
         8          1       6329       6892       6329          0          0    19        3292    0        -C-
         9          1       3574       4069       3574          0          0    19        6704    0        -C-
        10          1       4571       5599       4571          0          0    19        8724    0        -C-

Is this information helpful to understand what happened?

Thanks for your help and sorry for the delay.

skoren commented 4 years ago

That looks like you have corrected reads successfully added to the store, what's the contents of /scr/nicole/PacBio_v1.9/PacBio.seqStore now? You should be able to restart Canu with the same command you used before with the same folder and it will resume where it left off.

BioNica commented 4 years ago

Hi @skoren ,

Now the folder contains files up to blobs.0050

$ ls -lha PacBio.seqStore
total 50G
drwxrwxr-x 3 nicole nicole 4,0K Nov  4 19:46 .
drwxrwxr-x 3 nicole nicole  191 Out 22 17:41 ..
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:20 blobs.0000
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:21 blobs.0001
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:22 blobs.0002
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:22 blobs.0003
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:23 blobs.0004
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:24 blobs.0005
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:24 blobs.0006
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:25 blobs.0007
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:26 blobs.0008
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:26 blobs.0009
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:27 blobs.0010
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:28 blobs.0011
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:29 blobs.0012
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:29 blobs.0013
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:30 blobs.0014
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:31 blobs.0015
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:31 blobs.0016
-rw-rw-r-- 1 nicole nicole 1,1G Out 11 19:32 blobs.0017
-rw-rw-r-- 1 nicole nicole 228M Out 11 19:32 blobs.0018
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:41 blobs.0019
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:41 blobs.0020
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:41 blobs.0021
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:41 blobs.0022
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:41 blobs.0023
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:42 blobs.0024
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:42 blobs.0025
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:42 blobs.0026
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:42 blobs.0027
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:42 blobs.0028
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:42 blobs.0029
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:43 blobs.0030
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:43 blobs.0031
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:43 blobs.0032
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:43 blobs.0033
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:43 blobs.0034
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:43 blobs.0035
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:44 blobs.0036
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:44 blobs.0037
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:44 blobs.0038
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:44 blobs.0039
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:44 blobs.0040
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:44 blobs.0041
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:45 blobs.0042
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:45 blobs.0043
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:45 blobs.0044
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:45 blobs.0045
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:45 blobs.0046
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:46 blobs.0047
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:46 blobs.0048
-rw-rw-r-- 1 nicole nicole 1,1G Nov  4 19:46 blobs.0049
-rw-rw-r-- 1 nicole nicole 8,3M Nov  4 19:46 blobs.0050
-rw-rw-r-- 1 nicole nicole   88 Nov  4 19:46 info
-rw-rw-r-- 1 nicole nicole  475 Nov  4 19:46 info.txt
-rw-rw-r-- 1 nicole nicole  336 Nov  4 19:46 libraries
-rw-rw-r-- 1 nicole nicole    0 Nov  4 19:40 log_Canu_scratch_20191104.log
-rw-rw-r-- 1 nicole nicole 197M Nov  4 19:46 reads
drwxrwxr-x 2 nicole nicole   64 Nov  4 19:46 version.001

Given that this folder is at the scratch of the compute-node, while the others are symbolic links for the shared storage, the relative paths used in the scripts were pointing to the old folder (at the storage). Now I have a link on the storage pointing to my scratch (I know, it is weird), so I assure it will write on scratch.

It is running. Let's see if we get to the end of the script now ;-)

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_191' (from 'java') with -d64 support.
-- Detected gnuplot version '5.2 patchlevel 7   ' (from 'gnuplot') and image format 'png'.
-- Detected 144 CPUs and 3024 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /cm/shared/apps/slurm/17.11.8/bin/sinfo.
-- Grid engine disabled per useGrid=false option.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl     64 GB    8 CPUs x  18 jobs  1152 GB  144 CPUs  (k-mer counting)
-- Local: hap       16 GB   48 CPUs x   3 jobs    48 GB  144 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   60 GB   16 CPUs x   9 jobs   540 GB  144 CPUs  (overlap detection with mhap)
-- Local: obtovl    24 GB   16 CPUs x   9 jobs   216 GB  144 CPUs  (overlap detection)
-- Local: utgovl    24 GB   16 CPUs x   9 jobs   216 GB  144 CPUs  (overlap detection)
-- Local: cor       24 GB    4 CPUs x  36 jobs   864 GB  144 CPUs  (read correction)
-- Local: ovb        4 GB    1 CPU  x 144 jobs   576 GB  144 CPUs  (overlap store bucketizer)
-- Local: ovs       32 GB    1 CPU  x  94 jobs  3008 GB   94 CPUs  (overlap store sorting)
-- Local: red       48 GB    8 CPUs x  18 jobs   864 GB  144 CPUs  (read error detection)
-- Local: oea        8 GB    1 CPU  x 144 jobs  1152 GB  144 CPUs  (overlap error adjustment)
-- Local: bat      512 GB   32 CPUs x   1 job    512 GB   32 CPUs  (contig construction with bogart)
-- Local: cns      --- GB    8 CPUs x --- jobs   --- GB  --- CPUs  (consensus)
-- Local: gfa       64 GB   32 CPUs x   1 job     64 GB   32 CPUs  (GFA alignment and processing)
--
-- In 'PacBio.seqStore', found PacBio reads:
--   Raw:        5159463
--   Corrected:  3694042
--   Trimmed:    0
--
-- Generating assembly 'PacBio' in '/scr/nicole/PacBio_v1.9'
--
-- Parameters:
--
--  genomeSize        2500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
----------------------------------------
-- Starting command on Tue Nov  5 10:33:19 2019 with 2820.017 GB free disk space

    cd .
    /data04/tools/PacificBiosciences/canu-1.9/canu/Linux-amd64/bin/sqStoreDumpFASTQ \
      -corrected \
      -S ./PacBio.seqStore \
      -o ./PacBio.correctedReads.gz \
      -fasta \
      -nolibname \
    > PacBio.correctedReads.fasta.err 2>&1

-- Finished on Tue Nov  5 10:51:27 2019 (1088 seconds) with 2801.114 GB free disk space
----------------------------------------
--
-- Corrected reads saved in 'PacBio.correctedReads.fasta.gz'.
-- No change in report.
-- Finished stage 'cor-dumpCorrectedReads', reset canuIteration.
--
-- Bye.

[Update] It worked! 19G PacBio.correctedReads.fasta.gz

Canu -correct finished. Now it is running -trim

### executing:  canu  -trim -p PacBio -d PacBio_v1.9 genomeSize=2.5g -pacbio-corrected PacBio_v1.9/PacBio.correctedReads.fasta.gz  -s specFile-canu.txt     
skoren commented 4 years ago

OK, I'd suggest either keeping everything on scratch or everything on the NFS system. The I/O load I expect in trimming will be lower.

BioNica commented 4 years ago

Dear @skoren , thanks for your help! After the adjustments, canu -trim and canu -assemble have ended successfully and very fast!