Measuring pipeline's RAM usage with another method

leoisl commented 1 year ago

This PR adds code to watch the system's memory instead of trying to get the RAM usage from the pipeline process itself. The bad thing about this implementation is that to get precise measurements, we should not run anything else together with the pipeline. In my machine, I got better measurements with this method with small test data. Running on my benchmarking machine for the EBI plasmids query, the difference was not big, but my config was not ideal to test it, because the highest COBS job takes 10.35 GB of RAM (this is what the old method reports), but my RAM limit for the pipeline was 12 GB, so there is not much room to use more RAM. Anyway, this new RAM measurement is appended to the pipeline benchmark logs upon running it. For example this is what I have for the match and map steps now for EBI plasmids:

match

old method: 10.35 GB new method: 11.24 GB

cat match_2022_12_06T12_45_03.txt 
# Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=12288 -- match
real(s) sys(s)  user(s) percent_CPU max_RAM(kb) FS_inputs   FS_outputs  elapsed_time_alt(s)
28356.15    1601.52 132002.67   471%    10854992    143254656   1516440 28356.171665
RAM usage:
Initial RAM usage (GB): 0.55
Max RAM usage (GB): 11.80
mof-search predicted RAM usage (GB): 11.24

map

old method: 0.84 GB new method: 2.82 GB

cat map_2022_12_06T20_37_39.txt 
# Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=12288 -- map
real(s) sys(s)  user(s) percent_CPU max_RAM(kb) FS_inputs   FS_outputs  elapsed_time_alt(s)
5309.85 2339.90 24025.16    496%    882980  67777896    14248400    5309.870436
RAM usage:
Initial RAM usage (GB): 0.54
Max RAM usage (GB): 3.37
mof-search predicted RAM usage (GB): 2.82

I am curious what you will get in your machine with the same configs given here: https://github.com/karel-brinda/mof-search/issues/219

karel-brinda commented 1 year ago

Should I test this with my plasmid search experiment?

Also, how can I checkout your branches from a different repository (your own fork)?

leoisl commented 1 year ago

Should I test this with my plasmid search experiment?

Yeah, could you rerun https://github.com/karel-brinda/mof-search/issues/219 but with this code?

Also, how can I checkout your branches from a different repository (your own fork)?

I thought you could add my fork and checkout the branch, but I am not sure... I am pushing to your fork too

leoisl commented 1 year ago

Is here: https://github.com/karel-brinda/mof-search/tree/fix/219

karel-brinda commented 1 year ago

Thanks a lot! It definitely must be somehow possible as the decentralization was one core ideas behind git, but I didn't manage to find out how to do it easily. Going to run this know; thanks a lot!

karel-brinda commented 1 year ago

$ l logs/benchmarks/
total 16
drwxr-xr-x    6 pseudokarel  staff   192B Dec  8 02:16 .
drwxr-xr-x    4 pseudokarel  staff   128B Dec  8 02:16 ..
-rw-r--r--    1 pseudokarel  staff   336B Dec  8 02:17 match_2022_12_08T00_26_23.txt
-rw-r--r--    1 pseudokarel  staff    45B Dec  8 02:17 match_2022_12_08T00_26_23.txt.tmp
drwxr-xr-x  307 pseudokarel  staff   9.6K Dec  8 02:16 run_cobs
drwxr-xr-x    3 pseudokarel  staff    96B Dec  8 02:17 translate_matches

leoisl commented 1 year ago

Ahm, that is weird, it should be there, we create it here: https://github.com/karel-brinda/mof-search/blob/c16d45ab99bbfe8172f95dc350881ccb2ff5c73c/scripts/benchmark.py#L48-L50

Unless for some reason https://github.com/karel-brinda/mof-search/blob/fix/219/scripts/get_RAM_usage.py does not work on Mac. Does free --bytes command work for you?

karel-brinda commented 1 year ago

$ free --bytes
-bash: free: command not found

leoisl commented 1 year ago

Oh, it seems is not that simple to get this info in Mac (but there are several options): https://apple.stackexchange.com/questions/4286/is-there-a-mac-os-x-terminal-version-of-the-free-command-in-linux-systems

I get this when I run free -h on Linux:

$ free --bytes
              total        used        free      shared  buff/cache   available
Mem:    16529178624  8188702720   465960960  2002505728  7874514944  5997555712
Swap:    1027600384   899420160   128180224

The field I am interested on is the total used RAM (in this case is the 3rd field of the Mem line: 8188702720). If you could make a script that returns the total used RAM and that works on Mac, then I can do the rest (invoke the correct script in Linux or Mac)...

karel-brinda commented 1 year ago

We shouldn't bother with this too much. In particular, I'm a bit worried that this might be adding additional non-essential layers of complexity into the software.

I'll try to run it on the HMS cluster, which has Linux, and then save the results. In the future, we can use this type of measurement as optional, but we shouldn't be risking that the whole pipeline sometimes fail because of the RAM measurements.

leoisl commented 1 year ago

We shouldn't bother with this too much. In particular, I'm a bit worried that this might be adding additional non-essential layers of complexity into the software.

I'll try to run it on the HMS cluster, which has Linux, and then save the results. In the future, we can use this type of measurement as optional, but we shouldn't be risking that the whole pipeline sometimes fail because of the RAM measurements.

I vote for us to have the make benchmark command back, and just benchmark the run if this command is invoked. I am not so keen on running this on clusters because it can have several jobs from other users, which will make the benchmark inaccurate. In particular this RAM measurement method requires that only mof-search pipeline is running in the machine...

karel-brinda commented 1 year ago

In fact, the "non-ram" measurements are extremely valuable!! I'm using them all the time, and I also think we will need them in the future from users (given how fragile some of the tradeoffs can be).

Maybe we could have make benchmark that adds "additional non-standard" benchmarks that are useful only in certain situations and might be eg failing?

karel-brinda commented 1 year ago

because it can have several jobs from other users

This is great to know! I'll try to get an entire node.

karel-brinda commented 1 year ago

Another error message I'm getting:

scripts/benchmark.py --log logs/benchmarks/test_match_2022_12_08T13_28_29.txt "snakemake --jobs all --rerun-incomplete --p
rintshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=61440 --config
 batches=data/batches_small.txt nb_best_hits=1 -- match"
  File "scripts/get_RAM_usage.py", line 19
    fout.write(f"Initial RAM usage (GB): {initial_RAM:.2f}\n")
                                                            ^                                                             
SyntaxError: invalid syntax
Batches: ['actinobacillus_pleuropneumoniae__01', 'aeromonas_salmonicida__01', 'bacillus_anthracis__01']
Query files: ['queries/reads_4.fa', 'queries/reads_3.fasta', 'queries/reads_2.fq', 'queries/reads_1.fastq']

karel-brinda commented 1 year ago

And after this, again the previous error:

[Thu Dec  8 08:32:58 2022]
Finished job 0.
10 of 10 steps (100%) done
Complete log: /n/scratch3/users/k/kb219/mof-search/.snakemake/log/2022-12-08T082832.515610.snakemake.log
Traceback (most recent call last):
  File "/n/scratch3/users/k/kb219/mof-search/scripts/benchmark.py", line 72, in <module>
    main()
  File "/n/scratch3/users/k/kb219/mof-search/scripts/benchmark.py", line 63, in main
    with open(RAM_tmp_log_file) as log_fh_tmp, open(log_file, "a") as log_fh:
FileNotFoundError: [Errno 2] No such file or directory: 'logs/benchmarks/test_match_2022_12_08T13_28_29.txt.RAM.tmp'
make: *** [Makefile:23: test] Error 1

leoisl commented 1 year ago

We can do the same with a python cross-platform package: https://github.com/giampaolo/psutil , will fix this soon

karel-brinda commented 1 year ago

But I think the issue is the following: SyntaxError: invalid syntax. Btw. I think psutil doesn't work very well. They tried it in Snakemake several years ago and some abandoned.

leoisl commented 1 year ago

But I think the issue is the following: SyntaxError: invalid syntax.

This is an issue I introduced, I should not be using python here but instead using sys.executable

I think psutil doesn't work very well.

Yeah, I had some issues with it as well, but we can use psutil.virtual_memory().used to get the overall used memory in the system, and it works in linux just as well as free... I am hoping it will work on Mac too

karel-brinda commented 1 year ago

Is this branch supposed to work now?

leoisl commented 1 year ago

No, working on it... PRing soon..

karel-brinda commented 1 year ago

Perfect, thanks for the info!

leoisl commented 1 year ago

Hey @karel-brinda , does this work now in Mac? I mirrored this branch in your repo: https://github.com/karel-brinda/mof-search/tree/fix/219

Basically now we use psutil to query for the system memory, which should be portable. I failed to use psutil to precisely get the memory usage of a process (I had same results as what time gives us), but it could retrieve system memory usage correctly. Now, at every 0.1 second we query the system memory when we are running the pipeline (i.e. we just do this if the command we are running start with snakemake). For the benchmark logs, this PR adds a column, max_delta_system_RAM(kb), which denotes the maximum difference of RAM usage between the start of the pipeline execution and any point of the execution (e.g. if at the start, the system was using 6 GB of RAM, and the maximum RAM usage during the execution was 17 GB of RAM, then max_delta_system_RAM will be 9 GB). This should give us precise RAM measure only if the pipeline is being run alone in the machine. For example, this is benchmark log for the match part in the sample example:

# Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=12288 --config batches=data/batches_small.txt nb_best_hits=1 -- match
real(s) sys(s)  user(s) percent_CPU max_RAM(kb) FS_inputs   FS_outputs  elapsed_time_alt(s) max_delta_system_RAM(kb)
5.58    3.84    8.44    220%    1343268 912 608 5.588691    2669164.00

i.e. while time reports 1.28 GB of RAM usage (1343268 kb), this new method reports 2.54 GB (2669164.00 kb), and I think is actually what we use.

Could you please run this and check if it fixes https://github.com/karel-brinda/mof-search/issues/219?

leoisl commented 1 year ago

I don't like adding psutil as a dependency though... Mac users will need to install gtime and psutil to run the pipeline, but most of them will just be interested on running it, not benchmarking it. If this works for you, I'd vote for us to put back the benchmark commands: make benchmark and make test_benchmark.

karel-brinda commented 1 year ago

Agree about the installation of psutil and gtime being annoying and impractical if necessary only for auxiliary measurements. On the other, after several months of testing mof-search for different types of queries, the benchmarking logs are actually quite important.

I suggest the following solution:

Using the psutil benchmark only when psutil is present (try import psutil except ImportError)
On Mac replacing gtime by \time -l (I tested it and the output quantities are nearly identical, only the output format is slightly different)

karel-brinda commented 1 year ago

Thank you very much for implementing the new mem measurement method. I'm testing it right now.

leoisl commented 1 year ago

Closing this to reopen a new PR

karel-brinda / Phylign

Measuring pipeline's RAM usage with another method #220

match

map