Inconsistency of memory usage values within the report and timeline

phupe commented 5 years ago

Bug report

Expected behavior and actual behavior

expected behavior

Whatever we look at the Tasks table in Raw values mode or Human readable mode we expect to read the exact same values for the memory usage if the decimal system is used. We also expect these values to be the same in the Timeline.

actual behavior

in the Memory usage plot from the report, the memory value reported is 1051574272 (raw value extracted from Plotly) and displayed as 1.052G in the boxplot
in the Tasks table from the report in the Raw values mode, the memory value reported in the field vmem is 1051574272 which is consistent the previous value in the plot
in the Tasks table from the report in the Human readable mode, the memory value reported in the field vmem is 1.1GB seems to be ~ 1051574272 x (1.024 x 1.024). This value should be 0.97935GB = 1051574272/(1024 x 1024 x 1024) if the binary unit is used. Same behavior holds for peak_vmem and allocated memory fields but not for rss and peak_rss.
in the Timeline, the memory displayed is 1002.9 MB which is ~ 1051574272/(1024 x 1024) = 1002.859375. This is correct if the binary unit is used.

It should be mentioned whether you use decimal or binary unit to display the memory usage.

Steps to reproduce the problem

Install the program stress: sudo apt-get install stress or sudo yum install stress

Create the following nextflow.nf script:

#!/usr/bin/env nextflow

process TwoCpus4mn {

    cpus 1
    memory '1 GB'

    """
    /usr/bin/time -v stress -c 2 -t 15 -m 1 --vm-bytes 1000000000
    """
}

Launch:

nextflow nextflow.nf -with-report report.html -with-timeline timeline.html

Environment

Nextflow version: [18.10.1]
Java version: [OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)]
Operating system: [Linux 4.15.0-43-generic x86_64]

phupe commented 5 years ago

reply to myself:

Actually, the memory value reported in the field vmem with the value 1.1GB is due to the rounding performed in line 154 of the file modules/nextflow/src/main/resources/nextflow/trace/assets/ReportTemplate.js: return bytes.toFixed(1)+' '+units[u];

Changing this line to return bytes.toFixed(3)+' '+units[u]; gives 1.052 GB which is consistent with what is displayed in the boxplot. I suggest to apply this modification in ReportTemplate.js.

It seems that the report uses decimal unit for the memory while the timeline uses the binary unit. Same unit should be used for both.

pditommaso commented 5 years ago

Related to this I've also noted that the fields read_bytes and write_bytes are not consistent with the activity of stress, it should be used rchar and wchar instead. http://man7.org/linux/man-pages/man5/proc.5.html

stevekm commented 5 years ago

This is interesting, I started seeing issues in the reported memory usage on our new cluster as well (RHEL 7, Singularity 2.5.2), was not sure if its related. Example:

Obviously, nothing was using 1.5TB of RAM, but I have been seeing this in the reports. The trace from these tasks:

task_id hash    process %mem    rss vmem    peak_rss    peak_vmem   rchar   wchar   syscr   syscw
1129    89/416c95   custom_sample_report    0.0 82096128    1664303104  82231296    1576106508288   9044907 44077   1315    190
1135    bd/ea7a7c   custom_sample_report    0.0 82849792    1674637312  82984960    1576106508288   9567085 20826   1194    158
1137    1c/b5b8bc   custom_sample_report    0.0 84549632    1674637312  84684800    1576106508288   9566701 20809   1198    158
1120    9f/202012   custom_sample_report    0.0 82272256    1674637312  82407424    1576106508288   9567187 20977   1194    160
1124    7e/00070c   custom_sample_report    0.0 81670144    1675296768  81805312    1576106512384   9567045 20743   1191    158
1127    33/c31a21   custom_sample_report    0.0 81707008    1674637312  81842176    1576106508288   9566915 20805   1196    158
1123    dc/d4da94   custom_sample_report    0.0 81707008    1674637312  81842176    1576106508288   9567153 20807   1194    158
1131    fb/eaaaf4   custom_sample_report    0.0 81702912    1666248704  81838080    1576106508288   9566945 20656   1194    156
1148    8c/5b4a56   custom_sample_report    0.0 81666048    1666101248  81801216    1576106512384   9566807 20591   1195    156
1136    b3/c74a1e   custom_sample_report    0.0 81707008    1674637312  81842176    1576106508288   9566947 20809   1194    158
1138    80/677109   custom_sample_report    0.0 81707008    1674637312  81842176    1576106508288   9566767 20805   1195    158
1125    8c/795dc5   custom_sample_report    0.0 81707008    1674637312  81842176    1576106508288   9567329 20810   1194    158
1126    48/c6dae0   custom_sample_report    0.0 82714624    1674625024  82849792    1576106508288   9566919 20703   1194    158
1128    f0/309dc5   custom_sample_report    0.0 80650240    1664303104  80785408    1576106508288   9044899 44081   1312    190
1139    a9/3a2d55   custom_sample_report    0.0 81702912    1666248704  81838080    1576106508288   9566733 20658   1195    156
1130    d5/f3bbb7   custom_sample_report    0.0 81702912    1666248704  81838080    1576106508288   9567065 20657   1195    156
1132    43/19700d   custom_sample_report    0.0 81707008    1674637312  81842176    1576106508288   9566913 20813   1195    158
1133    0f/e99f8b   custom_sample_report    0.0 81702912    1666248704  81838080    1576106508288   9567085 20661   1195    156
1134    4b/800aee   custom_sample_report    0.0 81707008    1674637312  81842176    1576106508288   9566921 20809   1198    158
1142    be/6004f2   custom_sample_report    0.0 73023488    1576062324736   73158656    1576106508288   8943090 10795   1085    124
1143    f5/254075   custom_sample_report    0.0 72916992    1576068993024   73052160    1576106508288   8943104 10805   1085    124
1144    2e/a49717   custom_sample_report    0.0 74350592    1576019144704   74485760    1576106508288   8943098 10795   1087    124
1141    e5/c12f4d   custom_sample_report    0.0 72916992    1576062324736   73052160    1576106508288   8943090 10795   1085    124
1145    20/78905c   custom_sample_report    0.0 72916992    1576068997120   73052160    1576106508288   8943090 10795   1085    124
1140    99/234a34   custom_sample_report    0.0 72916992    1576073678848   73052160    1576106508288   8943090 10795   1085    124
1149    59/feb997   custom_sample_report    0.0 72912896    1576065224704   73048064    1576106512384   8942978 10715   1085    124
1146    e7/bc0bca   custom_sample_report    0.0 72916992    1576064217088   73052160    1576106508288   8943090 10795   1085    124
1147    59/70ae9f   custom_sample_report    0.0 72916992    1576073678848   73052160    1576106508288   8943090 10795   1085    124

edit: this might be more relevant for #980 actually

pditommaso commented 5 years ago

@stevekm Work in progress, stay in touch.

@phupe I've slightly modified the RSS figure, adding as a third tab in the Memory chart instead of having a separate chart for RSS memory. I think this a bit more consistent, also because I don't think it has much sense to report the RSS as percentage of memory request. My understanding is that RSS is much smaller than allocated (virtual) memory.

pditommaso commented 5 years ago

Included in version 19.01.0-edge.

phupe commented 5 years ago

@pditommaso:

This disccusion is actually related to Issue #980.

I see your point but let me clarified why it is important to have the percentage of memory requested for the peak RSS. Thus I really think that 2 different charts are needed for both vmem and peak_rss.

The total amount of memory used be a processs is the Virtual memory (vmem). The vmem contains all memory areas whether they are in the physical memory (RAM), in the Swap space, on the disk or shared with other processes.

The resident set size (RSS) is the amount of space of physical memory (RAM) held by a process.

The relationship is: vmem >= RSS + Swap.

The following page gives an interesting example (https://docs.hpc.qmul.ac.uk/using/memory/) to illustrate the difference of vmem and RSS.

Let's assume you have a newtflow process with the memory directive set to memory '1 GB'.

Let's assume we submit a job using PBS/Torque. According to the PbsExecutor.groovy file in nextflow source code, the job will be submitted using the following command: qsub -l mem=1gb.

The mem option of PBS/Torque corresponds to the maximum amount of physical memory used by the job, in other words, it is the maximum RSS value allowed by the process. PBS/Torque is generally set to kill the job if the process uses more memory than the job requested in the mem option.

Therefore, it is really important to figure out what is the peak RSS of your process for 2 main objectives:

you don't want your job to be killed for exceeding the requested RSS, thus you need a security threshold
you want to optimize the memory ressources booked on your node to allow the submission of other jobs since this is the value of the mem option that PBS/Torque will consider to submit or queued a job.

A boxplot with the percentage of peak RSS / requested RSS is definitively important to calibrate what amount of memory in the nextflow directive to meet these 2 objectives.

Maybe this is the terminology % Allocated which is misleading since it refers to memory allocation. I suggest maybe to simply use % of requested or % of booked in the legend.

pditommaso commented 5 years ago

I see your point. The main issue is that, it's not defined what kind of memory the directory memory should specify: virtual - or - physical. PBS is physical, SGE is virtual, K8s? Batch? frankly I don't. I guess it depends on how docker limit the memory. Looking the docker docs it seems suggesting that's the physical memory.

If we decide that NF memory represents the physical memory, then we can use it to represent the % Allocated chart instead of the virtual mem.

Does make sense ?

stevekm commented 5 years ago

To be honest I've never really understood what "% Allocated" means, not sure what allocation it's referring to. In all cases in my experience the physical memory in RAM is what you would expect to report and is the useful value, as @phupe describes, since it more directly impacts the success of your tasks and your workflow configuration. Whatever solutions you all come to, I hope it includes making this a more prominent part of the report

phupe commented 5 years ago

I see your point. The main issue is that, it's not defined what kind of memory the directory memory should specify: virtual - or - physical.

Indeed.

If we decide that NF memory represents the physical memory, then we can use it to represent the % Allocated chart instead of the virtual mem.

Does make sense ?

In my opinion, it really makes sense that memory represents the physical memory, or at least this should be the default. This means that the right memory option for all executors must identified. This would be the scenario1 (By the way, virtual_free seems to represent the amount of unused virtual memory according to http://www.univa.com/resources/files/univa_user_guide_univa__grid_engine_854.pdf, but I am not familiar with Grid Engine, other options are h_rss/s_rss for resident set size, or h_vmem/s_vmem for virtual memory size). In this scenario the only one chart for memory resource usage would be displayed with the peak_rss.

We could imagine a scenario2 where a memory.type directive is offered by nextflow. The value could be either rss or vmem, the option of all executors would selected according to this directive. Only one chart for the memory resource usage in the report would be displayed using either the vmem or rss depending on the type of memory set by the user. This scenario may take some time to be developed and I have absolutely no idea how many users would be interested to use it.

As a straightforward scenario0, I would display the 2 charts for both vmem and peak_rss as proposed in my pull request #985. By the way, I would suggest that, even for vmem, you use the peak value used by the process. Only peak values seem interested to me. I think that only 2 lines should be modified as follows in the file ReportSummary.groovy:

mappers.mem = { TraceRecord record -> record.get('peak_vmem') as Double }

final vmem = record.get('peak_vmem') as Double

pditommaso commented 5 years ago

I think the best solution would be scenario 1.

phupe commented 5 years ago

sounds good! Thanks

phupe commented 5 years ago

I have tested the following release: 845e6d1 [origin/master] [release 19.01.0-edge]

That's great you have added the RSS in the timeline html report.

However, I have still noticed some issues:

The memory reported in the timeline is in binary unit (power of 1024) while in the report it is in decimal unit (power of 1000)
The memory report in binary unit is 10 times bigger as it should be. For example, the raw value in the html report says that the peak_rss was 201965568 for task OneCpu200MBRAM. It corresponds to 201965568/1024/1024 = 192.6094 MiB, while in the timeline, the value is 1926 MB. Same thing hold for vmem for the same task. For the task OneCpu2GBRAM, the vmem value in the timeline is correct in binary unit while for peak_rss it is written 19GB. The is the same problem in the trace.csv file. See attached files below.

To reproduce the issue, use the following NF script:

#!/usr/bin/env nextflow

process OneCpu2GBRAM {

    cpus 1

    """
    stress --vm 1 --vm-bytes 2000000000 -t 120
    """
}

process OneCpu200MBRAM {

    cpus 1

    """
    stress --vm 1 --vm-bytes 200000000 -t 120
    """
}

trace.csv.txt report.html.txt timeline.html.txt

pditommaso commented 5 years ago

Umm, I'm not able to replicate this. Using your script I'm getting these numbers

task_id  hash       native_id  name            status     exit  submit                   duration  realtime  %cpu    rss       peak_rss  peak_vmem  vmem      rchar  wchar  read_bytes  write_bytes
2        b1/f21e08  13597      OneCpu2GBRAM    COMPLETED  0     2019-01-14 11:22:28.266  2m 3s     2m        102.2%  1.9 GB    1.9 GB    1.9 GB     1.9 GB    39 KB  329 B  284 KB      0
1        a3/ecbc0c  13596      OneCpu200MBRAM  COMPLETED  0     2019-01-14 11:22:28.256  2m 3s     2m        101.8%  194.3 MB  194.5 MB  220.8 MB   220.8 MB  39 KB  332 B  96 KB       0

Once executed, could you change in the work dir of one task and run NXF_DEBUG=1 bash .command.run and attach here the output (including the file .command.trace)?

phupe commented 5 years ago

find attached the requested information.


>> cat .command.trace
nextflow.trace/v2
realtime=120062
%cpu=1000
rchar=53284
wchar=324
syscr=120
syscw=14
read_bytes=36864
write_bytes=0
%mem=1
vmem=323108
rss=194096
peak_vmem=323204
peak_rss=197224

output-debug.txt

pditommaso commented 5 years ago

This was the 200 MB process, right? The peak_rss=197224 looks fine.

phupe commented 5 years ago

fyi, I reproduce the same problem on my side.

phupe commented 5 years ago

This was the 200 MB process, right? The peak_rss=197224 looks fine.

sorry for not mentioning it, but Yes, it was the 200MB process.

pditommaso commented 5 years ago

therefore where's the problem? in the timeline/html report files? In my case are fine (except the power of 10 vs power of 2 problem).

phupe commented 5 years ago

In the timeline html, I have : OneCpu200MBRAM 2m 1s / 3156 MB (1926 MB)

but I should have instead: OneCpu200MBRAM 2m 1s / 315.6 MB (192.6 MB)`

The values are 10 times bigger as they should be.

pditommaso commented 5 years ago

Also for the run reporting peak_rss=197224 in the trace file ?

phupe commented 5 years ago

I have the same problem with version 18.10.1, I obtained OneCpu200MBRAM 2m 14s / 3155 MB instead of 315.5 MB.

I removed the $HOME/.nextflow dir and restarted with 19.01.0-edge but still the same problem.

Could this be related to java? Here is what I have:

>> java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

pditommaso commented 5 years ago

I would exclude that's a Java error. Also I don't see where it can fail. The mem values are read from the .command.trace here, test here.

Then it's included in the json payload already formatted as a memory unit, test here.

phupe commented 5 years ago

I launch nextflow using the following command:

nextflow-19.01.0-edge-all run nextflow-cpu.nf -with-report report.html -with-dag dag.pdf -with-trace trace.csv -with-timeline timeline.html

I have noticed that you have much more fields in the trace file you posted:

task_id  hash       native_id  name            status     exit  submit                   duration  realtime  %cpu    rss       peak_rss  peak_vmem  vmem      rchar  wchar  read_bytes  write_bytes

I have only:

task_id hash    native_id   name    status  exit    submit  duration    realtime    %cpu rss    vmem    rchar   wchar

Have you used a specific option to get all these outpus in the trace file?

pditommaso commented 5 years ago

Yes, add these fields in your config file:

trace {
  enabled = true
  fields =  ['task_id',
            'hash',
            'native_id',
            'name',
            'status',
            'exit',
            'submit',
            'duration',
            'realtime',
            '%cpu',
            'rss',
            'peak_rss',
            'peak_vmem',
            'vmem',
            'rchar',
            'wchar',
            'read_bytes',
            'write_bytes']
}

You can use any field declared in the TraceRecord class

phupe commented 5 years ago

Thanks.

I have added the lines in my config file.

Still have the same issue.

pditommaso commented 5 years ago

Are you using a containerised execution?

phupe commented 5 years ago

I use local executor without container.

pditommaso commented 5 years ago

Unfortunately I'm not able to replicate this issue. However there are two main possibilities: 1) the mem usage is erroneously computed by the background script, in this case it should be reported an wrong value in the .command.trace file or 2) the mem value is correct in this file and erroneously reported in the timeline/html reports.

phupe commented 5 years ago

I had the issue on centos7. I tried on ubuntu and I am not able to reproduce the issue either. Then I tried to run nextflow within a centos7 singularity container and I don't reproduce the issue either. Very weird.

pditommaso commented 5 years ago

Can you confirm if the .commad.trace contains the correct data when the issue is raised ?

phupe commented 5 years ago

The content of .command.trace is:

>> cat .command.trace 
nextflow.trace/v2
realtime=12021
%cpu=997
rchar=50838
wchar=317
syscr=120
syscw=14
read_bytes=0
write_bytes=0
%mem=9
vmem=323104
rss=146536
peak_vmem=323252
peak_rss=197136

The content of the trace.csv genererated with the option -with-trace trace.csv is:

>> cat trace.csv 
task_id  hash       native_id  name            status     exit  submit                   duration  realtime  %cpu   rss      vmem     rchar   wchar
1        71/e6839d  15547      OneCpu2GBRAM    COMPLETED  0     2019-01-15 11:58:04.949  12.2s     12.1s     99,3%  18 GB    2 GB     496 KB  317 B
2        18/35c53b  15583      OneCpu200MBRAM  COMPLETED  0     2019-01-15 11:58:05.026  12.2s     12s       99,7%  1431 MB  3155 MB  496 KB  317 B

The content of the timeline.html genererated with the option -with-timeline timeline.html is:

OneCpu2GBRAM 12.2s / 2 GB (19 GB)
OneCpu200MBRAM 12.2s / 3155 MB (1925 MB)

The content of source code of the report.html genererated with the option -with-report report.html is:

{"task_id":"1","hash":"71\/e6839d","native_id":"15547","process":"OneCpu2GBRAM","module":"-","container":"-","tag":"-","name":"OneCpu2GBRAM","status":"COMPLETED","exit":"0","submit":"1547549884949","start":"1547549885051","complete":"1547549897142","duration":"12193","realtime":"12110","%cpu":"99.3","%mem":"11.8","rss":"1946517504","vmem":"2130857984","peak_rss":"2001915904","peak_vmem":"2131009536","rchar":"50840","wchar":"317","syscr":"120","syscw":"14","read_bytes":"0","write_bytes":"0","attempt":"1","workdir":"\/tmp\/work\/71\/e6839db705f7df4f06d146a4924b3f","script":"\n    stress --vm 1 --vm-bytes 2000000000 -t 12\n    ","scratch":"-","queue":"-","cpus":"1","memory":"-","disk":"-","time":"-","env":"-","error_action":"-"}

{"task_id":"2","hash":"18\/35c53b","native_id":"15583","process":"OneCpu200MBRAM","module":"-","container":"-","tag":"-","name":"OneCpu200MBRAM","status":"COMPLETED","exit":"0","submit":"1547549885026","start":"1547549885060","complete":"1547549897210","duration":"12184","realtime":"12021","%cpu":"99.7","%mem":"0.9","rss":"150052864","vmem":"330858496","peak_rss":"201867264","peak_vmem":"331010048","rchar":"50838","wchar":"317","syscr":"120","syscw":"14","read_bytes":"0","write_bytes":"0","attempt":"1","workdir":"\/tmp\/work\/18\/35c53be7dbb940c2cceaf5a6b952ef","script":"\n    stress --vm 1 --vm-bytes 200000000 -t 12\n

We see that for vmem for the process OneCpu200MBRAM it is:

vmem=323104 in the .commamd.trace
3155 MB in the trace.csv
3155 MB in the timeline.html
"vmem":"330858496" in the report.html

We are performing other tests on other computers.

Thank you.

pditommaso commented 5 years ago

I think I've understood what is happening here. You have for rss 1431 MB while it should be 143.1 MB and for vmem 3155 MB instead of 315.5 MB ... basically it's missing the dot.

I guess it depends on your local config but surely NF should not care take of local setting and always use the dot a decimal separator. Good one!

pditommaso commented 5 years ago

I've opened a separate issue for this #999.

phupe commented 5 years ago

I have found the problem thanks to your comment! It is a locale issue.

My guess is that you should be able to reproduce the issue if you set export LANG=fr_FR.UTF-8 in the bash environment before launching nextflow.

The decimal point magically appears whether I set export LANG=C, export LANG=, or export LANG=en_US.UTF-8

Happy to see we have an explanation!

pditommaso commented 5 years ago

There's always an explanation ;) The only problem that remains is that execution report are rendered using plotty.js which uses a power 10 scale. Not sure it's possible to fix.

pditommaso commented 5 years ago

I've reviewed this and actually yes, all backends use physical memory to limit the job allocation (expect SGE for which is quite a mess..). Therefore I agree that would be better to use to compute the % Allocated tab in the execution report.

To summarise currently the execution report shows three tab in the Memory Usage chart: 1) Raw usage (that's virtual mem) 2) % Allocated (ratio virtual mem / requested mem) 3) RSS (physical mem used)

It makes sense to update it as follow: 1) Virtual mem 2) Physical mem (rss) 3) % Allocate (ration rss / requested mem)

@phupe @emi80 @ewels @nf-core/core Thoughts?

ewels commented 5 years ago

Yes, sounds good! That was my intention with % Allocated - how much of the mem: requirement was being used. I wasn't aware of the difference between Virtual mem and RSS. I say go for it 👍 - but can I request that we add a sentence in the report explaining what the tabs mean? Or perhaps change % Allocated label to % Physical Allocated...?

Regarding Java locale issues - I feel your pain! I've hit similar bugs for lots of Java programs in MultiQC (I'm looking at you Picard). It's extremely difficult to detect and fix reliably, leading all kinds of nasty hacks.. 😞

pditommaso commented 5 years ago

Yes, it would make sense to add a sentence at the bottom of the chart to explaining it.

Regarding Java locale it's trivial to fix, when you realise that exists more than one language out there :D

ewels commented 5 years ago

Not trivial when the programs generating a file and ingesting the file are separate and can exist on different systems 😉 I think we should all speak the same language, use the same numeric notation and all use the metric system 😆 SI units only!

phupe commented 5 years ago

Thanks guys for your feedback!

Sounds good to me!

Maybe % Allocated could be labelled % RAM Allocated as most of the people are familiar with RAM terminology.

I suggest to use #980 Issue for this discussion about rss/vmem in the report.

Best,

pditommaso commented 5 years ago

Closing this in favour of #980.

pditommaso commented 5 years ago

Included in version 19.02.0-edge.

stevekm commented 5 years ago

Hey I tried it out on the 19.02.0-edge and it looks great. However I am seeing crazy values still for RAM usage; here it says I used upwards of 12TB RAM in some steps, which is obviously not right. Not sure if its related or if I should file a bug report on it? It only started happening after switching to our new RHEL 7 system, using Singularity containers.

pditommaso commented 5 years ago

This sounds weird. if you are able to replicate the issue with this test script open an issue reporting the host and container OS version you are using.

nextflow-io / nextflow