Open nuxwin opened 3 months ago
Hi @nuxwin!
Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! In the mean time, feel free to add any relevant information to this issue.
@nuxwin Does this happen without the Monitor stage?
@mdemoret-nv Of Course, yes.
@mdemoret-nv This doesn't seem directly related to the kafka source stage anyway. I'm wondering if this is not due to the asyncio loop. I get the same problem with the below pipeline. For us, this look like a big problem for a production use.
#!/opt/conda/envs/morpheus/bin/python
import logging
import click
import pandas as pd
import time
from morpheus.config import Config, CppConfig, PipelineModes
from morpheus.messages.message_meta import MessageMeta
from morpheus.pipeline.linear_pipeline import LinearPipeline
from morpheus.pipeline.stage_decorator import source, stage
from morpheus.utils.logger import configure_logging
logger = logging.getLogger("morpheus.{__name__}")
@source
def source_generator() -> Generator[MessageMeta, None, None]:
while True:
time.sleep(5)
yield MessageMeta(df=pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}))
@stage
def simple_stage(message: MessageMeta) -> MessageMeta:
logger.debug(f"simple_stage:\n\n{message.df.to_string()}")
return message
@click.command()
@click.option(
"--num_threads",
default=1,
type=click.IntRange(min=1),
help="Number of internal pipeline threads to use.",
)
@click.option(
"--pipeline_batch_size",
default=1,
type=click.IntRange(min=1),
help="Internal batch size for the pipeline. Can be much larger than the model batch size",
)
def run_pipeline(num_threads, pipeline_batch_size):
configure_logging(log_level=logging.DEBUG)
CppConfig.set_should_use_cpp(False)
config = Config()
config.mode = PipelineModes.OTHER
config.num_threads = num_threads
config.pipeline_batch_size = pipeline_batch_size
pipeline = LinearPipeline(config)
pipeline.set_source(source_generator(config))
pipeline.add_stage(simple_stage(config))
pipeline.run()
if __name__ == "__main__":
run_pipeline()
Any new ?
I'm wondering if this is not due to the asyncio loop
I'm suspecting the same thing. Its possible the asyncio loop is just polling for changes and seeing nothing scheduled so it just repeats the process until there is some work.
For us, this look like a big problem for a production use.
Can you elaborate more on why this would be a big problem for production use for you? If the high CPU usage is due to the asyncio loop, it likely is not impacting performance of the pipeline. The loop is only spinning because there is no other work to do. Once messages are in the pipeline, they will occupy the CPU instead of the asyncio loop.
@mdemoret-nv
The problem is the hight CPU (core) usage >=100% at full time. The machine's fans speeding up because of this. I wonder if this could reduce the lifespan of the CPU. I don't think that having a CPU core at 100% is something normal, especially when there is no other processing than a polling. There should be a sleep or sth like this between each polling, assuming that the problem come from the loop. Often, the high CPU usage are encoutered in while(true) loops when there is no sleep, especially when no treatment is made.
Hope you get my English .
Yes I understand what you are saying. I agree that the pipeline should not be utilizing 100% of the CPU if there is no work to be processed. We will need to look into why the asyncio loop is consistently spinning. A simple solution could be to schedule a small sleep in the loop when there is no more work.
I was wondering if there was anything specific to your deployment where 100% CPU utilization would cause problems beyond the added energy use and wear and tear. For example, some environments utilize the CPU utilization to scale their system. If the CPU was always at 100%, then it would scale infinitely which would be a problem. And the solution I suggested above may not work in that environment.
@nuxwin Also note that top
by default shows the sum of utilization across all CPU cores so if you have 12 cores, the maximum utilization would be 1200%. You can check the number of cores using the command nproc --all
. You can also do Shift+i
while in top
to see average utilization per core.
I was wondering if there was anything specific to your deployment where 100% CPU utilization would cause problems beyond the added energy use and wear and tear. For example, some environments utilize the CPU utilization to scale their system. If the CPU was always at 100%, then it would scale infinitely which would be a problem. And the solution I suggested above may not work in that environment.
We are developing for financial entities, among other. Our clients make use of ESXi VMs (using NVIDIA vGPUs). They won't accept such CPU usage on an "idle" pipeline.
Thank you for your time. That's much appreciated.
@nuxwin Also note that
top
by default shows the sum of utilization across all CPU cores so if you have 12 cores, the maximum utilization would be 1200%. You can check the number of cores using the commandnproc --all
. You can also doShift+i
while intop
to see average utilization per core.
I'm talking about a CPU core which is 100% used, not about the CPU usage average ;) So yeah, of course, for a machine with 10 cores, average usage would be reduced to 10%. But the problem remain : there is a core that is 100% used, all the time.
Version
24.3
Which installation method(s) does this occur on?
Docker, Source
Describe the bug.
100% CPU (core) usage while normal CPU usage expected.
Increasing value of the poll_interval doesn't change anything, even when set to 2s.
Minimum reproducible example
Relevant log output
Full env printout
Click here to see environment details
Other/Misc.
Code of Conduct