ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
101 stars 13 forks source link

Resource control/speed #6

Closed olekto closed 2 years ago

olekto commented 2 years ago

Hi, do you have any estimate for how long FCS-GX should run? I have started it on several genomes all around 700 Mbp in size, and they all have run for 77 hours having gone through about 6-7 % of the scaffolds. As far as I can see, it is only running with 1 thread. Is it possible to run it with multiple threads? How do you control that? I couldn't find any documentation on it, and trying to run gx directly from the container does not show any threads or multiprocessing as options.

Do you usually split of the genomes you are working with? How much processing power do you use for a human genome assembly for instance?

Thank you.

Sincerely, Ole

pstrope commented 2 years ago

That's a really long time. How much memory are you using? We suggest 500 GB memory. Something like 700 Mbp genome should only take about 3 mins.

olekto commented 2 years ago

Ah, I should have read the manual more carefully. I just scheduled a job on our SLURM system for 64 GB to see how it performs. I'll restart with 500 GB memory.

It's a single thread job?

Ole

tor. 11. aug. 2022, 18:47 skrev Pooja Strope @.***>:

That's a really long time. How much memory are you using? We suggest 500 GB memory. Something like 700 Mbp genome should only take about 3 mins.

— Reply to this email directly, view it on GitHub https://github.com/ncbi/fcs/issues/6#issuecomment-1212232447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMOP2Z6KI4TTCUFBKCZY3DVYUVCBANCNFSM56IKAVBQ . You are receiving this because you authored the thread.Message ID: @.***>

pstrope commented 2 years ago

FCS-GX will use up to 48 cores.

olekto commented 2 years ago

How can I control how many CPUs are being used? I could not see an option for this anywhere, and when I checked the job, it only used one. I might have looked at it when it was in parts where it could only use one.

When I run this on a shared cluster, it would be nice to have some control over the resources used. And I would prefer running it with lots of CPU then one.

Thank you.

Ole

tor. 11. aug. 2022, 21:45 skrev Pooja Strope @.***>:

FCS-GX will use up to 48 cores.

— Reply to this email directly, view it on GitHub https://github.com/ncbi/fcs/issues/6#issuecomment-1212417408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMOP27EYJUO6E3VOLHLUUDVYVJ43ANCNFSM56IKAVBQ . You are receiving this because you authored the thread.Message ID: @.***>

olekto commented 2 years ago

Running with more RAM (512 GB) makes the job finish in 15 minutes and not multiple days. Thank you!

pstrope commented 2 years ago

The number of CPU's can be controlled by an env variable. We are working on another release with this option added. Your feedback is appreciated!

etvedte commented 2 years ago

Hi Ole,

Our newest release 0.2.3 includes the ability to control number of CPUs by specifying an environment variables text file with the corresponding --env-file parameter: See (https://github.com/ncbi/fcs/wiki/FCS-GX#usage-examples) for more information.

cat env.txt
GX_NUM_CORES=8

python3 ./run_fcsgx.py --fasta test.fna.gz --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/all" --gx-db-disk ./gxdb --split-fasta --tax-id 508771 --env-file env.txt

Please give it a try and let us know if you come across any issues.

olekto commented 2 years ago

Great! Thank you.

Closing this then.

Ole

weilu1998 commented 2 years ago

Hi Ole,

Our newest release 0.2.3 includes the ability to control number of CPUs by specifying an environment variables text file with the corresponding --env-file parameter: See (https://github.com/ncbi/fcs/wiki/FCS-GX#usage-examples) for more information.

cat env.txt
GX_NUM_CORES=8

python3 ./run_fcsgx.py --fasta test.fna.gz --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/all" --gx-db-disk ./gxdb --split-fasta --tax-id 508771 --env-file env.txt

Please give it a try and let us know if you come across any issues.

Hi @etvedte ,

I have a related question. How to specify number of CPUs in Singularity environment? I try the env option but it is not available.

Thanks, Wei

etvedte commented 2 years ago

Hi Wei,

The --env-file parameter should work with Singularity. My guess is that you followed the Quickstart and downloaded an old singularity image. We need to update that part of the documentation.

Can you try downloading the newest singularity image and retry? If it still doesn't work, please send me the full command you used. curl https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/releases/0.2.3/fcs-gx.0.2.3.sif -Lo fcsgx.sif

weilu1998 commented 2 years ago

Hi @etvedte ,

Thank you! You are right, I was not using the newest version. When I ran the job with 500G mem and 48 cores for a 280Mb insect genome, it still takes a long time, did I set something wrong? Here is the output.

`size : 281.7 MiB split-fa : True BLAST-div : gx-div : anml:insects w/same-tax: True bin-dir : /app/bin gx-db : /app/db/gxdb/all output :


Prefetched memory-mapped pages in 444.672s; 0.717454 GB/s. Collecting masking statistics... Using GX_NUM_CORES=48 Collected masking stats: 0.295401 Gbp; 3.73636s; 79.0611 Mbp/s. Baseline: 1.35005

Prefetched memory-mapped pages in 250.476s; 0.707011 GB/s. Using GX_NUM_CORES=48 18.4MiB 1:01:25 [ 0 B/s] [5.11kiB/s] [========> ] 6% ETA 14:39:52`

etvedte commented 2 years ago

Hi @weilu1998 ,

We will try to diagnose the issue. Can you try a few things?

It appears that the database files are swapped-out by the OS, and the processing slows down due to thrashing. To confirm that the database files are in memory, please provide the output of vmtouch, like below.

>time vmtouch -m1000G -v /my/path/to/gxdb/all.gx*
/dev/shm/gxdb/all.gxi
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 77888542/77888542
/dev/shm/gxdb/all.gxs
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 43234624/43234624

           Files: 2
     Directories: 0
  Resident Pages: 121123166/121123166  462G/462G  100%
         Elapsed: 0.28313 seconds

real    0m0.289s
user    0m0.213s
sys 0m0.076s

Please run GX until the prefetching stages are done, and it starts to process the input sequence. Then terminate it with CTRL-C. At that point, can you provide the output of free -h command?

Lastly, while GX is running, can you open a separate terminal, or a separate session in screen or tmux, and run the top command and adding the SWAP column? Instructions here: https://www.thegeekdiary.com/how-to-check-swap-usage-live-via-the-top-command-in-linux/