Closed leoisl closed 1 year ago
Should I test this with my plasmid search experiment?
Also, how can I checkout your branches from a different repository (your own fork)?
Should I test this with my plasmid search experiment?
Yeah, could you rerun https://github.com/karel-brinda/mof-search/issues/219 but with this code?
Also, how can I checkout your branches from a different repository (your own fork)?
I thought you could add my fork and checkout the branch, but I am not sure... I am pushing to your fork too
Thanks a lot! It definitely must be somehow possible as the decentralization was one core ideas behind git, but I didn't manage to find out how to do it easily. Going to run this know; thanks a lot!
$ l logs/benchmarks/
total 16
drwxr-xr-x 6 pseudokarel staff 192B Dec 8 02:16 .
drwxr-xr-x 4 pseudokarel staff 128B Dec 8 02:16 ..
-rw-r--r-- 1 pseudokarel staff 336B Dec 8 02:17 match_2022_12_08T00_26_23.txt
-rw-r--r-- 1 pseudokarel staff 45B Dec 8 02:17 match_2022_12_08T00_26_23.txt.tmp
drwxr-xr-x 307 pseudokarel staff 9.6K Dec 8 02:16 run_cobs
drwxr-xr-x 3 pseudokarel staff 96B Dec 8 02:17 translate_matches
Ahm, that is weird, it should be there, we create it here: https://github.com/karel-brinda/mof-search/blob/c16d45ab99bbfe8172f95dc350881ccb2ff5c73c/scripts/benchmark.py#L48-L50
Unless for some reason https://github.com/karel-brinda/mof-search/blob/fix/219/scripts/get_RAM_usage.py does not work on Mac. Does free --bytes
command work for you?
$ free --bytes
-bash: free: command not found
Oh, it seems is not that simple to get this info in Mac (but there are several options): https://apple.stackexchange.com/questions/4286/is-there-a-mac-os-x-terminal-version-of-the-free-command-in-linux-systems
I get this when I run free -h
on Linux:
$ free --bytes
total used free shared buff/cache available
Mem: 16529178624 8188702720 465960960 2002505728 7874514944 5997555712
Swap: 1027600384 899420160 128180224
The field I am interested on is the total used RAM (in this case is the 3rd field of the Mem
line: 8188702720
). If you could make a script that returns the total used RAM and that works on Mac, then I can do the rest (invoke the correct script in Linux or Mac)...
We shouldn't bother with this too much. In particular, I'm a bit worried that this might be adding additional non-essential layers of complexity into the software.
I'll try to run it on the HMS cluster, which has Linux, and then save the results. In the future, we can use this type of measurement as optional, but we shouldn't be risking that the whole pipeline sometimes fail because of the RAM measurements.
We shouldn't bother with this too much. In particular, I'm a bit worried that this might be adding additional non-essential layers of complexity into the software.
I'll try to run it on the HMS cluster, which has Linux, and then save the results. In the future, we can use this type of measurement as optional, but we shouldn't be risking that the whole pipeline sometimes fail because of the RAM measurements.
I vote for us to have the make benchmark
command back, and just benchmark the run if this command is invoked. I am not so keen on running this on clusters because it can have several jobs from other users, which will make the benchmark inaccurate. In particular this RAM measurement method requires that only mof-search pipeline is running in the machine...
In fact, the "non-ram" measurements are extremely valuable!! I'm using them all the time, and I also think we will need them in the future from users (given how fragile some of the tradeoffs can be).
Maybe we could have make benchmark
that adds "additional non-standard" benchmarks that are useful only in certain situations and might be eg failing?
because it can have several jobs from other users
This is great to know! I'll try to get an entire node.
Another error message I'm getting:
scripts/benchmark.py --log logs/benchmarks/test_match_2022_12_08T13_28_29.txt "snakemake --jobs all --rerun-incomplete --p
rintshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=61440 --config
batches=data/batches_small.txt nb_best_hits=1 -- match"
File "scripts/get_RAM_usage.py", line 19
fout.write(f"Initial RAM usage (GB): {initial_RAM:.2f}\n")
^
SyntaxError: invalid syntax
Batches: ['actinobacillus_pleuropneumoniae__01', 'aeromonas_salmonicida__01', 'bacillus_anthracis__01']
Query files: ['queries/reads_4.fa', 'queries/reads_3.fasta', 'queries/reads_2.fq', 'queries/reads_1.fastq']
And after this, again the previous error:
[Thu Dec 8 08:32:58 2022]
Finished job 0.
10 of 10 steps (100%) done
Complete log: /n/scratch3/users/k/kb219/mof-search/.snakemake/log/2022-12-08T082832.515610.snakemake.log
Traceback (most recent call last):
File "/n/scratch3/users/k/kb219/mof-search/scripts/benchmark.py", line 72, in <module>
main()
File "/n/scratch3/users/k/kb219/mof-search/scripts/benchmark.py", line 63, in main
with open(RAM_tmp_log_file) as log_fh_tmp, open(log_file, "a") as log_fh:
FileNotFoundError: [Errno 2] No such file or directory: 'logs/benchmarks/test_match_2022_12_08T13_28_29.txt.RAM.tmp'
make: *** [Makefile:23: test] Error 1
We can do the same with a python cross-platform package: https://github.com/giampaolo/psutil , will fix this soon
But I think the issue is the following: SyntaxError: invalid syntax
. Btw. I think psutil doesn't work very well. They tried it in Snakemake several years ago and some abandoned.
But I think the issue is the following:
SyntaxError: invalid syntax
.
This is an issue I introduced, I should not be using python
here but instead using sys.executable
I think psutil doesn't work very well.
Yeah, I had some issues with it as well, but we can use psutil.virtual_memory().used
to get the overall used memory in the system, and it works in linux
just as well as free
... I am hoping it will work on Mac too
Is this branch supposed to work now?
No, working on it... PRing soon..
Perfect, thanks for the info!
Hey @karel-brinda , does this work now in Mac? I mirrored this branch in your repo: https://github.com/karel-brinda/mof-search/tree/fix/219
Basically now we use psutil
to query for the system memory, which should be portable. I failed to use psutil
to precisely get the memory usage of a process (I had same results as what time
gives us), but it could retrieve system memory usage correctly. Now, at every 0.1 second we query the system memory when we are running the pipeline (i.e. we just do this if the command we are running start with snakemake
). For the benchmark logs, this PR adds a column, max_delta_system_RAM(kb)
, which denotes the maximum difference of RAM usage between the start of the pipeline execution and any point of the execution (e.g. if at the start, the system was using 6 GB of RAM, and the maximum RAM usage during the execution was 17 GB of RAM, then max_delta_system_RAM
will be 9 GB). This should give us precise RAM measure only if the pipeline is being run alone in the machine. For example, this is benchmark log for the match part in the sample example:
# Benchmarking command: snakemake --jobs all --rerun-incomplete --printshellcmds --keep-going --use-conda --resources max_download_threads=8 max_io_heavy_threads=8 max_ram_mb=12288 --config batches=data/batches_small.txt nb_best_hits=1 -- match
real(s) sys(s) user(s) percent_CPU max_RAM(kb) FS_inputs FS_outputs elapsed_time_alt(s) max_delta_system_RAM(kb)
5.58 3.84 8.44 220% 1343268 912 608 5.588691 2669164.00
i.e. while time
reports 1.28 GB of RAM usage (1343268 kb), this new method reports 2.54 GB (2669164.00 kb), and I think is actually what we use.
Could you please run this and check if it fixes https://github.com/karel-brinda/mof-search/issues/219?
I don't like adding psutil
as a dependency though... Mac users will need to install gtime
and psutil
to run the pipeline, but most of them will just be interested on running it, not benchmarking it. If this works for you, I'd vote for us to put back the benchmark commands: make benchmark
and make test_benchmark
.
Agree about the installation of psutil
and gtime
being annoying and impractical if necessary only for auxiliary measurements. On the other, after several months of testing mof-search for different types of queries, the benchmarking logs are actually quite important.
I suggest the following solution:
gtime
by \time -l
(I tested it and the output quantities are nearly identical, only the output format is slightly different)Thank you very much for implementing the new mem measurement method. I'm testing it right now.
Closing this to reopen a new PR
This PR adds code to watch the system's memory instead of trying to get the RAM usage from the pipeline process itself. The bad thing about this implementation is that to get precise measurements, we should not run anything else together with the pipeline. In my machine, I got better measurements with this method with small test data. Running on my benchmarking machine for the EBI plasmids query, the difference was not big, but my config was not ideal to test it, because the highest COBS job takes 10.35 GB of RAM (this is what the old method reports), but my RAM limit for the pipeline was 12 GB, so there is not much room to use more RAM. Anyway, this new RAM measurement is appended to the pipeline benchmark logs upon running it. For example this is what I have for the match and map steps now for EBI plasmids:
match
old method: 10.35 GB new method: 11.24 GB
map
old method: 0.84 GB new method: 2.82 GB
I am curious what you will get in your machine with the same configs given here: https://github.com/karel-brinda/mof-search/issues/219