richelbilderbeek / bbbq_article

The Bianchi Bilderbeek Bogaart Question answered
GNU General Public License v3.0
0 stars 0 forks source link

bbbq_2: run on Peregrine on full proteome #63

Closed richelbilderbeek closed 4 years ago

richelbilderbeek commented 4 years ago

I can run bbbq_2 only on a part of the proteome locally, due to memory errors. Get it to run on Peregrine by allocating enough memory.

richelbilderbeek commented 4 years ago
p230198@peregrine:bbbq_2 sbatch run_r_script.sh create_ctc.R 
Submitted batch job 13476311
p230198@peregrine:bbbq_2 sbatch --dependency=afterany:13476311 ../../peregrine/scripts/email_me.sh 
Submitted batch job 13476312

1GB of memory:

p230198@peregrine:bbbq_2 cat run_r_script.sh 
#!/bin/bash
# Bash script to run an R script with sbatch
#
# Usage:
#
#   sbatch run_r_script.sh my_r_script.R
#
#SBATCH --time=10:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --job-name=run_r_script
#SBATCH --output=run_r_script_%j.log
module load R
echo "Rscript $@"
Rscript "$@"
richelbilderbeek commented 4 years ago

Need downloaded proteomes present. Try again.

p230198@peregrine:bbbq_2 sbatch run_r_script.sh create_ctc.R 
Submitted batch job 13476343
p230198@peregrine:bbbq_2 sbatch --dependency=afterany:13476343 ../../peregrine/scripts/email_me.sh 
Submitted batch job 13476344
richelbilderbeek commented 4 years ago

10 GB is not enough, use 100 GB:

p230198@peregrine:bbbq_2 sbatch run_r_script.sh create_ctc.R 
Submitted batch job 13477367
richelbilderbeek commented 4 years ago

1 TB of memory is not enough:

p230198@peregrine:bbbq_2 cat run_r_script_13477562.log
Rscript create_ctc.R
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

use default substitution matrix
Could not allocate a distance matrix for 49137 seqs. Need to terminate program.
Error in msaFun(inputSeqs = inputSeqs, cluster = cluster, gapOpening = gapOpening,  : 
  std::bad_alloc
Calls: <Anonymous> -> <Anonymous> -> msaFun
Execution halted

###############################################################################
Peregrine Cluster
Job 13477562 for user 'p230198'
Finished at: Mon Aug 31 23:38:43 CEST 2020

Job details:
============

Name                : run_r_script
User                : p230198
Partition           : himem
Nodes               : pg-memory05
Cores               : 1
State               : FAILED
Submit              : 2020-08-31T12:59:26
Start               : 2020-08-31T23:36:41
End                 : 2020-08-31T23:38:42
Reserved walltime   : 10:00:00
Used walltime       : 00:02:01
Used CPU time       : 00:01:54 (efficiency: 94.34%)
% User (Computation): 96.29%
% System (I/O)      :  3.71%
Mem reserved        : 1000G/node
Max Mem used        : 1.83G (pg-memory05)
Max Disk Write      : 102.40K (pg-memory05)
Max Disk Read       : 2.80M (pg-memory05)

Acknowledgements:
=================

Please see this page for information about acknowledging Peregrine in your publications:

https://wiki.hpc.rug.nl/peregrine/additional_information/scientific_output

################################################################################

I predict the std::bad_alloc is the bottleneck, with a too big (or negative) value.

richelbilderbeek commented 4 years ago

Using 100 peptides per gene:

p230198@peregrine:bbbq_2 sbatch run_r_script.sh create_all_ctcs.R 
Submitted batch job 13493111
richelbilderbeek commented 4 years ago

100 sampled sequences per gene, takes 8 mins max, for 2 GB max:

p230198@peregrine:bbbq_2 egrep -R "Used walltime" --include=*.log
run_r_script_13493205.log:Used walltime       : 00:01:43
run_r_script_13493219.log:Used walltime       : 00:01:35
run_r_script_13493211.log:Used walltime       : 00:01:27
run_r_script_13493218.log:Used walltime       : 00:01:33
run_r_script_13493221.log:Used walltime       : 00:01:28
run_r_script_13493212.log:Used walltime       : 00:03:02
run_r_script_13493214.log:Used walltime       : 00:01:59
run_r_script_13493215.log:Used walltime       : 00:01:43
run_r_script_13493222.log:Used walltime       : 00:01:48
run_r_script_13493111.log:Used walltime       : 00:01:41
run_r_script_13493210.log:Used walltime       : 00:01:28
run_r_script_13493202.log:Used walltime       : 00:02:17
run_r_script_13493217.log:Used walltime       : 00:01:46
run_r_script_13493209.log:Used walltime       : 00:01:34
run_r_script_13493216.log:Used walltime       : 00:01:39
run_r_script_13493207.log:Used walltime       : 00:01:41
run_r_script_13493213.log:Used walltime       : 00:02:13
run_r_script_13493201.log:Used walltime       : 00:01:32
run_r_script_13493208.log:Used walltime       : 00:01:37
run_r_script_13493204.log:Used walltime       : 00:02:00
run_r_script_13493220.log:Used walltime       : 00:01:24
run_r_script_13493203.log:Used walltime       : 00:08:00
run_r_script_13493206.log:Used walltime       : 00:01:42
p230198@peregrine:bbbq_2 egrep -R "Max Mem used" --include=*.log
run_r_script_13493205.log:Max Mem used        : 1.78G (pg-node040)
run_r_script_13493219.log:Max Mem used        : 1.78G (pg-node188)
run_r_script_13493211.log:Max Mem used        : 1.03G (pg-node172)
run_r_script_13493218.log:Max Mem used        : 1.79G (pg-node007)
run_r_script_13493221.log:Max Mem used        : 1.05G (pg-node172)
run_r_script_13493212.log:Max Mem used        : 1.80G (pg-node128)
run_r_script_13493214.log:Max Mem used        : 1.78G (pg-node044)
run_r_script_13493215.log:Max Mem used        : 1.78G (pg-node040)
run_r_script_13493222.log:Max Mem used        : 1.78G (pg-node150)
run_r_script_13493111.log:Max Mem used        : 1.44G (pg-node176)
run_r_script_13493210.log:Max Mem used        : 1.05G (pg-node176)
run_r_script_13493202.log:Max Mem used        : 1.78G (pg-node057)
run_r_script_13493217.log:Max Mem used        : 1.35G (pg-node013)
run_r_script_13493209.log:Max Mem used        : 1.78G (pg-node188)
run_r_script_13493216.log:Max Mem used        : 1.78G (pg-node031)
run_r_script_13493207.log:Max Mem used        : 1.33G (pg-node013)
run_r_script_13493213.log:Max Mem used        : 1.78G (pg-node057)
run_r_script_13493201.log:Max Mem used        : 1.79G (pg-node176)
run_r_script_13493208.log:Max Mem used        : 1.78G (pg-node007)
run_r_script_13493204.log:Max Mem used        : 1.78G (pg-node044)
run_r_script_13493220.log:Max Mem used        : 1.07G (pg-node176)
run_r_script_13493203.log:Max Mem used        : 1.78G (pg-node054)
run_r_script_13493206.log:Max Mem used        : 1.78G (pg-node031)
richelbilderbeek commented 4 years ago

Now with 1k samples:

p230198@peregrine:bbbq_2 sbatch run_r_script.sh create_all_ctcs.R 
Submitted batch job 13493526
richelbilderbeek commented 4 years ago

Results for 1k, 3GB: bbbq_2_20200902.zip

Used walltime, 25 minutes on average:

p230198@peregrine:bbbq_2 egrep -R "Used walltime" --include=*.log
run_r_script_13493526.log:Used walltime       : 00:01:56
run_r_script_13493747.log:Used walltime       : 00:16:47
run_r_script_13493759.log:Used walltime       : 00:13:58
run_r_script_13493749.log:Used walltime       : 00:03:52
run_r_script_13493750.log:Used walltime       : 00:08:57
run_r_script_13493764.log:Used walltime       : 00:29:29
run_r_script_13493762.log:Used walltime       : 00:03:06
run_r_script_13493758.log:Used walltime       : 00:14:54
run_r_script_13493753.log:Used walltime       : 00:02:40
run_r_script_13493748.log:Used walltime       : 00:15:47
run_r_script_13493755.log:Used walltime       : 00:57:59
run_r_script_13493757.log:Used walltime       : 00:21:21
run_r_script_13493760.log:Used walltime       : 00:03:38
run_r_script_13493756.log:Used walltime       : 00:44:44
run_r_script_13493752.log:Used walltime       : 00:05:19
run_r_script_13493763.log:Used walltime       : 00:04:51
run_r_script_13493743.log:Used walltime       : 00:07:44
run_r_script_13493745.log:Used walltime       : 09:51:52
run_r_script_13493751.log:Used walltime       : 00:04:44
run_r_script_13493744.log:Used walltime       : 01:08:38
run_r_script_13493761.log:Used walltime       : 00:10:25
run_r_script_13493754.log:Used walltime       : 02:12:22
run_r_script_13493746.log:Used walltime       : 00:41:05

Used memory:

p230198@peregrine:bbbq_2 egrep -R "Max Mem used" --include=*.log
run_r_script_13493526.log:Max Mem used        : 1.08G (pg-node131)
run_r_script_13493747.log:Max Mem used        : 1.78G (pg-node137)
run_r_script_13493759.log:Max Mem used        : 1.78G (pg-node124)
run_r_script_13493749.log:Max Mem used        : 1.85G (pg-node146)
run_r_script_13493750.log:Max Mem used        : 1.78G (pg-node146)
run_r_script_13493764.log:Max Mem used        : 1.78G (pg-node050)
run_r_script_13493762.log:Max Mem used        : 0.00  ()
run_r_script_13493758.log:Max Mem used        : 1.78G (pg-node131)
run_r_script_13493753.log:Max Mem used        : 1.78G (pg-node137)
run_r_script_13493748.log:Max Mem used        : 1.78G (pg-node137)
run_r_script_13493755.log:Max Mem used        : 1.78G (pg-node137)
run_r_script_13493757.log:Max Mem used        : 1.78G (pg-node146)
run_r_script_13493760.log:Max Mem used        : 1.78G (pg-node050)
run_r_script_13493756.log:Max Mem used        : 1.78G (pg-node127)
run_r_script_13493752.log:Max Mem used        : 1.78G (pg-node131)
run_r_script_13493763.log:Max Mem used        : 1.78G (pg-node050)
run_r_script_13493743.log:Max Mem used        : 1.78G (pg-node131)
run_r_script_13493745.log:Max Mem used        : 1.78G (pg-node131)
run_r_script_13493751.log:Max Mem used        : 1.78G (pg-node137)
run_r_script_13493744.log:Max Mem used        : 1.78G (pg-node131)
run_r_script_13493761.log:Max Mem used        : 1.78G (pg-node024)
run_r_script_13493754.log:Max Mem used        : 1.78G (pg-node137)
run_r_script_13493746.log:Max Mem used        : 1.85G (pg-node137)

All passed:

p230198@peregrine:bbbq_2 egrep -R "State" --include=*.log
run_r_script_13493526.log:State               : COMPLETED
run_r_script_13493747.log:State               : COMPLETED
run_r_script_13493759.log:State               : COMPLETED
run_r_script_13493749.log:State               : COMPLETED
run_r_script_13493750.log:State               : COMPLETED
run_r_script_13493764.log:State               : COMPLETED
run_r_script_13493762.log:State               : RUNNING [Note: it has finished]
run_r_script_13493758.log:State               : COMPLETED
run_r_script_13493753.log:State               : COMPLETED
run_r_script_13493748.log:State               : COMPLETED
run_r_script_13493755.log:State               : COMPLETED
run_r_script_13493757.log:State               : COMPLETED
run_r_script_13493760.log:State               : COMPLETED
run_r_script_13493756.log:State               : COMPLETED
run_r_script_13493752.log:State               : COMPLETED
run_r_script_13493763.log:State               : COMPLETED
run_r_script_13493743.log:State               : COMPLETED
run_r_script_13493745.log:State               : COMPLETED
run_r_script_13493751.log:State               : COMPLETED
run_r_script_13493744.log:State               : COMPLETED
run_r_script_13493761.log:State               : COMPLETED
run_r_script_13493754.log:State               : COMPLETED
run_r_script_13493746.log:State               : COMPLETED

The running job has finished:

p230198@peregrine:bbbq_2 head run_r_script_13493762.log
Rscript create_ctc.R NS6 1000 42
gene_name: NS6
max_n_sequences: 1000
rng_seed: 42
target_filename: NS6.csv
proteomes_filename: allprot0621.fasta

p230198@peregrine:bbbq_2 head NS6.csv 
aa,score,is_tmh
M,4952096,0
F,5918245,0
H,7898296,0
L,3960080,0
V,3960080,0
D,5954083,0
richelbilderbeek commented 4 years ago

@fransbianchi is right: just use the unique sequences, as it does give an idea on conservation.

richelbilderbeek commented 4 years ago

These are the results when sampling 1k sequences from the 45k sequences:

scores

scores_boxplot

Now, run only on unique sequences :+1:

richelbilderbeek commented 4 years ago

Running the unique sequences:

p230198@peregrine:bbbq_2 sbatch run_r_script.sh create_all_ctcs.R 
Submitted batch job 13507988

ETA: +4 hours

richelbilderbeek commented 4 years ago

There was a bug, due to which only ?NSP1 was done. Fixed and restarted:

p230198@peregrine:bbbq_2 sbatch run_r_script.sh create_all_ctcs.R 
Submitted batch job 13508016
richelbilderbeek commented 4 years ago

NSP14

>NSP14|hCoV-19/Australia/NT12/2020|2020-03-25|EPI_ISL_426900|Original|hCoV-19^^Northern territory|Human|Royal Darwin Hospital Pathology|Microbiological Diagnostic Unit Public Health Laboratory and Victorian Infectious Diseases Reference Laboratory|Schultz|Australia
AENVTGLFKDCSKVITGLHPTQAPTHLSVDTKFKTEGLCVDIPGIPKDMTYRRLISMMGFKMNYQVNGYPNMFITREEAIRHVRAWIGFDVEGCHATREAVGTNLPLQLGFSTGVNLVAVPTGYVDTPNNTDFSRVSAKPPPGDQFKHLIPLMYKGLPWNVVRIKIVQMLSDTLKNLSDRVVFVLWAHGFELTSMKYFVKIGPERTCCLCDRRATCFSTASDTYACWHHSIGFDYVYNPFMIDVQQWGFTGNLQSNHDLYCQVHGNAHVASCDAIMTRCLAVHECFVKRVDWTIEYPIIGDELKINAACRKVQHMVVKAALLADKFPVLHDIGNPKAIKCVPQADVEWKFYDAQPCSDKAYKIEELFYSYATHSDKFTDGVCLFWNCNVDRYPANSIVCRFDTRVLSNLNLPGCDGGSLYVNKHAFHTPAFDKSAFVNLKQLPFFYYSDSPCESHGKQVVSDIDYVPLKSATCITRCNLGGAVCRHHANEYRLYLDAYNMMISAGFSLWVYKQFDTYNLWNTFTRLQ

(5645 unique sequences) takes 18 hours

p230198@peregrine:bbbq_2 cat run_r_script_13548454.log
Rscript create_ctc.R NSP14
gene_name: NSP14
target_filename: NSP14.csv
proteomes_filename: allprot0621.fasta
number of unique sequences: 5645

###############################################################################
Peregrine Cluster
Job 13548454 for user 'p230198'
Finished at: Sun Sep  6 08:27:43 CEST 2020

Job details:
============

Name                : run_r_script
User                : p230198
Partition           : regular
Nodes               : pg-node184
Cores               : 1
State               : COMPLETED
Submit              : 2020-09-05T14:56:38
Start               : 2020-09-05T14:57:16
End                 : 2020-09-06T08:27:41
Reserved walltime   : 10-00:00:00
Used walltime       :    17:30:25
Used CPU time       :    17:30:09 (efficiency: 99.97%)
% User (Computation): 99.95%
% System (I/O)      :  0.05%
Mem reserved        : 30G/node
Max Mem used        : 1.97G (pg-node184)
Max Disk Write      : 81.92K (pg-node184)
Max Disk Read       : 1.07M (pg-node184)

Acknowledgements:
=================

Please see this page for information about acknowledging Peregrine in your publications:

https://wiki.hpc.rug.nl/peregrine/additional_information/scientific_output

################################################################################

Only NSP14 is running now, with 10427 unique sequences, but just as much AAs, ETA +72 hours.

>NSP14|hCoV-19/Australia/NT12/2020|2020-03-25|EPI_ISL_426900|Original|hCoV-19^^Northern territory|Human|Royal Darwin Hospital Pathology|Microbiological Diagnostic Unit Public Health Laboratory and Victorian Infectious Diseases Reference Laboratory|Schultz|Australia
AENVTGLFKDCSKVITGLHPTQAPTHLSVDTKFKTEGLCVDIPGIPKDMTYRRLISMMGFKMNYQVNGYPNMFITREEAIRHVRAWIGFDVEGCHATREAVGTNLPLQLGFSTGVNLVAVPTGYVDTPNNTDFSRVSAKPPPGDQFKHLIPLMYKGLPWNVVRIKIVQMLSDTLKNLSDRVVFVLWAHGFELTSMKYFVKIGPERTCCLCDRRATCFSTASDTYACWHHSIGFDYVYNPFMIDVQQWGFTGNLQSNHDLYCQVHGNAHVASCDAIMTRCLAVHECFVKRVDWTIEYPIIGDELKINAACRKVQHMVVKAALLADKFPVLHDIGNPKAIKCVPQADVEWKFYDAQPCSDKAYKIEELFYSYATHSDKFTDGVCLFWNCNVDRYPANSIVCRFDTRVLSNLNLPGCDGGSLYVNKHAFHTPAFDKSAFVNLKQLPFFYYSDSPCESHGKQVVSDIDYVPLKSATCITRCNLGGAVCRHHANEYRLYLDAYNMMISAGFSLWVYKQFDTYNLWNTFTRLQ
richelbilderbeek commented 4 years ago

All except NSP3 (which uses obsolete values):

scores scores_all scores_boxplot_all

scores_boxplot

richelbilderbeek commented 4 years ago

After 8 days, NSP3 is still running:

Find the log file:

p230198@peregrine:bbbq_2 egrep -Rl "NSP3" --include=*.log | sort
run_r_script_13508016.log
run_r_script_13508020.log
run_r_script_13515364.log
run_r_script_13516318.log
run_r_script_13519793.log
run_r_script_13519799.log
run_r_script_13539948.log
run_r_script_13539951.log
run_r_script_13547460.log
run_r_script_13548437.log

Show the latest log file:

p230198@peregrine:bbbq_2 cat run_r_script_13548437.log
Rscript create_ctc.R NSP3
gene_name: NSP3
target_filename: NSP3.csv
proteomes_filename: allprot0621.fasta
number of unique sequences: 10427

Find the job:

p230198@peregrine:bbbq_2 q | egrep 13548437
          13548437   regular run_r_sc  p230198  R 8-19:21:24      1 pg-node184 
richelbilderbeek commented 4 years ago

Note that using ClustalOmega may increase the speed:

From https://link.springer.com/protocol/10.1007%2F978-1-62703-646-7_6:

This [a Clustal Omega] algorithm allows very large alignment problems to be tackled very quickly, even on personal computers

richelbilderbeek commented 4 years ago

It is still running:

p230198@peregrine:~ jobinfo 13548437
Name                : run_r_script
User                : p230198
Partition           : regular
Nodes               : pg-node184
Cores               : 1
State               : RUNNING
Submit              : 2020-09-05T14:56:36
Start               : 2020-09-05T14:57:16
End                 : --
Reserved walltime   : 10-00:00:00
Used walltime       : 9-19:33:32
Used CPU time       : --
% User (Computation): --
% System (I/O)      : --
Mem reserved        : 30G/node
Max Mem used        : 2.39G (pg-node184)
Max Disk Write      : 82.50K (pg-node184)
Max Disk Read       : 1.07M (pg-node184)
richelbilderbeek commented 4 years ago

Too bad, 10 days is not enough for MSA with ClustalW and PureseqTM

p230198@peregrine:~ jobinfo 13548437
Name                : run_r_script
User                : p230198
Partition           : regular
Nodes               : pg-node184
Cores               : 1
State               : CANCELLED,TIMEOUT
Submit              : 2020-09-05T14:56:36
Start               : 2020-09-05T14:57:16
End                 : 2020-09-15T14:57:21
Reserved walltime   : 10-00:00:00
Used walltime       : 10-00:00:05
Used CPU time       : 9-23:57:56 (efficiency: 99.99%)
% User (Computation): 99.97%
% System (I/O)      :  0.03%
Mem reserved        : 30G/node
Max Mem used        : 2.39G (pg-node184)
Max Disk Write      : 81.92K (pg-node184)
Max Disk Read       : 1.07M (pg-node184)