ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

Performance differences between 0.2.2 and 0.3.0 #29

Closed olekto closed 1 year ago

olekto commented 1 year ago

Hi.

I have to admit that I do not completely understand the technical aspects of holding the database in memory, and everything around it. Of course I understand that it is much quicker, but how to set it up and that stuff I don't understand that well.

For instance, we use a cluster, where we would get different nodes for each job we submit. So creating permanent shared memory is not suitable, and I think we cannot do it when the nodes are shared between different users anyhow.

What we did with 0.2.2 was to set SHM_LOC=. It worked quite well, and FCS-GX finished in minutes for even larger genomes. With 0.3.0-beta, this text is shown if I configure this similarly:

    Page-fault rate for accessing /app/db/gxdb/all.gxs is 106% (should be 0).
    This means that the in-memory GX database is either not on RAM-backed filesystem, or swapped-out.
    GX requires the database to be entirely in RAM; otherwise it will run extremely slow.
    Consider placing the database files in a non-swappable ramfs.
    Or `vmtouch -l -v -m 1000G /path/to/gxdb/all.gx{i,s}` to lock the database pages in RAM.

    Will prefetch (vmtouch) the database pages to have the OS cache them in main memory.
    export GX_PREFETCH=0 to turn off prefetching; =1 - auto(default); =2 - always-on.

What has changed compared to 0.2.2? This process seems to take much longer. For instance, FCS-GX finished in 3 minutes for a 180 Mbp insect genome with 0.2.2, while with 0.3.0-beta it has been running for more than 20 minutes for the same genome.

20-30 minutes is still quick, but much slower than what was before.

Thank you.

Ole

murphyte commented 1 year ago

Hi Ole -- could you elaborate on how you were running 0.2.2, and trying for 0.3.0?

Logistically, you have two choices for how to get the db into memory:

  1. copy it into a ramdisk space, either /dev/shm or tmpfs
  2. skip copying, and leave it to GX to prefetch the db into memory

2 involves setting SHM_LOC=<disk path> and the --gx-db "${SHM_LOC}/gxdb/all" parameter. You'll see that Page-fault message from GX, at which time it will automatically do the vmtouch command to cache the database into RAM before screening. The copy or prefetch speed is dependent on your file system. For us, it's either 8 or 35 minutes, depending on where we're reading from (new or older tech).

Are you saying before you were getting fast runs (3 minutes for 180 Mbp, which is about 1/4th of what we get with 48 cores and the db already in /dev/shm) WITHOUT an explicit copy into ramdisk step or waiting for prefetching? We have one other user where we think their HPC file system may have a sizable cache that is able to provide random access to the db that is much faster than SSD but not quite as fast as having the db already in local memory. But it requires the db to have been read recently (i.e. in the file system cache). We're not positive that's the explanation for their results, but it seemed plausible. Could that be the case for your setup? Try copying all the db files anywhere (cp all* /dev/null/ might do it?) and see if that speeds up your subsequent run.

We're revamping the commands a bit more in a new version coming soon, which will hopefully make the logistics of working with the db easier to understand.