snayfach / MIDAS

An integrated pipeline for estimating strain-level genomic variation from metagenomic data
http://dx.doi.org/10.1101/gr.201863.115
GNU General Public License v3.0
120 stars 52 forks source link

Question regarding speed of execution #129

Open rdauria opened 2 years ago

rdauria commented 2 years ago

I mange applications on a research cluster and our researchers have been reporting issues with execution speed of your software on out cluster.

I have just run through the first step of the tutorial (https://github.com/snayfach/MIDAS/blob/master/docs/tutorial.md) and I wonder whether you could let me know whether the timing (see below) we are getting are exceedingly long.

I am running the code a HPC network storage and on one core of a Intel(R) Xeon(R) Gold 6240 node (I can also provide timings on the following steps in the tutorial if that would help). Moving the DB, the sample file and the output directory to local storage did not seem to affect the speed significantly.

Thanks and please see the timings for the first step of the tutorial below:

/usr/bin/time -p -v run_midas.py species midas_output/SAMPLE_1 -1 /u/local/apps/midas/EXAMPLE/example/sample_1.fq.gz --remove_temp 

MIDAS: Metagenomic Intra-species Diversity Analysis System
version 1.3.0; github.com/snayfach/MIDAS
Copyright (C) 2015-2016 Stephen Nayfach
Freely distributed under the GNU General Public License (GPLv3)

===========Parameters===========
Command: /u/local/apps/midas/1.3.2/MIDAS/scripts/run_midas.py species midas_output/SAMPLE_1 -1 /u/local/apps/midas/EXAMPLE/example/sample_1.fq.gz --remove_temp
Script: run_midas.py species
Database: /u/local/apps/midas/DB/midas_db_v1.2
Output directory: midas_output/SAMPLE_1
Input reads (unpaired): /u/local/apps/midas/EXAMPLE/example/sample_1.fq.gz
Remove temporary files: True
Word size for database search: 28
Minimum mapping alignment coverage: 0.75
Number of reads to use from input: use all
Number of threads for database search: 1
================================

Aligning reads to marker-genes database
  0.66 minutes
  0.75 Gb maximum memory

Classifying reads
  total alignments: 2916
  uniquely mapped reads: 1013
  ambiguously mapped reads: 47
  0.0 minutes
  0.76 Gb maximum memory

Estimating species abundance
  total marker-gene coverage: 10.637
  0.0 minutes
  0.76 Gb maximum memory
    Command being timed: "run_midas.py species midas_output/SAMPLE_1 -1 /u/local/apps/midas/EXAMPLE/example/sample_1.fq.gz --remove_temp"
    User time (seconds): 43.04
    System time (seconds): 6.53
    Percent of CPU this job got: 118%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.71
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 646160
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 83162
    Voluntary context switches: 18072
    Involuntary context switches: 1161
    Swaps: 0
    File system inputs: 391032
    File system outputs: 904
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
rdauria commented 2 years ago

I should add that I installed MIDAS using python version 3.9.6, I have just noticed that when running the second part of the tutorial (snps) there are issues with the multiprocessing package that will give this error:

TypeError: cannot pickle '_io.TextIOWrapper' object

Any idea? What version of python do you support?

FYI, the error in context is:

/usr/bin/time -p -v run_midas.py snps midas_output/SAMPLE_1 -1 /u/local/apps/midas/EXAMPLE/ex
ample/sample_1.fq.gz -t 8

MIDAS: Metagenomic Intra-species Diversity Analysis System
version 1.3.0; github.com/snayfach/MIDAS
Copyright (C) 2015-2016 Stephen Nayfach
Freely distributed under the GNU General Public License (GPLv3)

===========Parameters===========
Command: /u/local/apps/midas/1.3.2/MIDAS/scripts/run_midas.py snps midas_output/SAMPLE_1 -1 /u/local/apps/midas/EXAMPLE/example/sample_1.fq.gz -t
 8
Script: run_midas.py snps
Database: /u/local/apps/midas/DB/midas_db_v1.2
Output directory: midas_output/SAMPLE_1
Remove temporary files: False
Pipeline options:
  build bowtie2 database of genomes
  align reads to bowtie2 genome database
  use samtools to generate pileups and count variants
Database options:
  include all species with >=3.0X genome coverage
Read alignment options:
  input reads (unpaired): /u/local/apps/midas/EXAMPLE/example/sample_1.fq.gz
  alignment speed/sensitivity: very-sensitive
  alignment mode: global
  number of reads to use from input: use all
  number of threads for database search: 8
SNP calling options:
  minimum alignment percent identity: 94.0
  minimum mapping quality score: 20
  minimum base quality score: 30
  minimum read quality score: 20
  minimum alignment coverage of reads: 0.75
  trim 0 base-pairs from 3'/right end of read
================================

Reading reference data
  0.0 minutes
  0.1 Gb maximum memory

Building database of representative genomes
  total genomes: 1
  total contigs: 1
  total base-pairs: 5163189
  0.04 minutes
  0.26 Gb maximum memory

Mapping reads to representative genomes
  finished aligning
  checking bamfile integrity
  0.09 minutes
  0.44 Gb maximum memory

Indexing bamfile
  0.0 minutes
  0.44 Gb maximum memory

Counting alleles
Traceback (most recent call last):
  File "/u/local/apps/midas/1.3.2/MIDAS/scripts/run_midas.py", line 757, in <module>
    run_program(program, args)
  File "/u/local/apps/midas/1.3.2/MIDAS/scripts/run_midas.py", line 82, in run_program
    snps.run_pipeline(args)
  File "/u/local/apps/midas/1.3.2/MIDAS/midas/run/snps.py", line 301, in run_pipeline
    pysam_pileup(args, species, contigs)
  File "/u/local/apps/midas/1.3.2/MIDAS/midas/run/snps.py", line 228, in pysam_pileup
    aln_stats = utility.parallel(species_pileup, argument_list, args['threads'])
  File "/u/local/apps/midas/1.3.2/MIDAS/midas/utility.py", line 101, in parallel
    return [r.get() for r in results]
  File "/u/local/apps/midas/1.3.2/MIDAS/midas/utility.py", line 101, in <listcomp>
    return [r.get() for r in results]
  File "/u/local/apps/python/3.9.6/gcc-4.8.5/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/u/local/apps/python/3.9.6/gcc-4.8.5/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
    put(task)
  File "/u/local/apps/python/3.9.6/gcc-4.8.5/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/u/local/apps/python/3.9.6/gcc-4.8.5/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
Command exited with non-zero status 1
        Command being timed: "run_midas.py snps midas_output/SAMPLE_1 -1 /u/local/apps/midas/EXAMPLE/example/sample_1.fq.gz -t 8"
        User time (seconds): 55.55
        System time (seconds): 5.84
        Percent of CPU this job got: 669%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.16
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 358876
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 218678
        Voluntary context switches: 61387
        Involuntary context switches: 631
        Swaps: 0
        File system inputs: 79176
        File system outputs: 169336
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1