Open B-Step62 opened 3 weeks ago
Hi @B-Step62 , let me work on this feature. Thank you!
@B-Step62 I'm currently encountering the error related to tensorflow-cpu
. It seems https://github.com/mlflow/mlflow/pull/10998 tried to address the issue, but this PR hasn't been merged yet. I'll follow this solution provided by that PR, but just wanted to let you know that there is a bug in the installation of tensorflow-cpu
and we encounter that bug when following CONTRIBUTING.md
.
As that PR is almost 1 year ago, I can raise the PR when this task finishes, if necessary.
ERROR: Could not find a version that satisfies the requirement tensorflow-cpu<=2.12.0 (from versions: none)
ERROR: No matching distribution found for tensorflow-cpu<=2.12.0
[Update] That error seems to be related to Apple Silicon. Only the Tensorflow version 2.13.0 or higher supports Apple Silicon (article).
@y-okt Thanks for the heads-up! Yes, the tensorflow installation error on apple sillicon has been there for a bit while. It would be highly appreciated if you could find some good fix, but please go ahead with excluding the dependency for now and prioritize this FR🙂
Got it, thank you for your advice! @B-Step62
@B-Step62 Hello, I created a design doc. Could you please review this? If there is no problem, I'll proceed with implementation. Thanks
@y-okt Thank you so much for writing it up! The design choice sounds reasonable to me. Just a bit of additional intro about the library would be appreciated, so that other folks can learn the context quickly. Anyway, please feel free to start the implementation with the recommended option😄
Thank you @B-Step62 for reviewing! Got it, I'll add more details to the documentation as well as start the implementation!
@B-Step62 It seems that smolagents doesn't support Python 3.9 (reference). I also tested locally and confirmed there is no available version. As far as I quickly saw this repository, it seems that the Python version is set to 3.9, meaning that we can't install smolagents. Is there any way to use Python 3.10 in this repository? If not, one option is to make MLflow support Python 3.10, but I think this needs to be done carefully so that it won’t cause any breaking changes due to incompatibility. Could you share your opinion?
ERROR: Ignored the following versions that require a different python version: 0.1.0 Requires-Python >=3.10; 0.1.2 Requires-Python >=3.10; 0.1.3 Requires-Python >=3.10; 1.0.0 Requires-Python >=3.10; 1.1.0 Requires-Python >=3.10; 1.10.0 Requires-Python >=3.10; 1.11.0 Requires-Python >=3.10; 1.12.0 Requires-Python >=3.10; 1.13.0 Requires-Python >=3.10; 1.14.0 Requires-Python >=3.10; 1.2.0 Requires-Python >=3.10; 1.2.1 Requires-Python >=3.10; 1.2.2 Requires-Python >=3.10; 1.3.0 Requires-Python >=3.10; 1.4.0 Requires-Python >=3.10; 1.4.1 Requires-Python >=3.10; 1.5.0 Requires-Python >=3.10; 1.5.1 Requires-Python >=3.10; 1.6.0 Requires-Python >=3.10; 1.7.0 Requires-Python >=3.10; 1.8.0 Requires-Python >=3.10; 1.8.1 Requires-Python >=3.10; 1.9.0 Requires-Python >=3.10; 1.9.1 Requires-Python >=3.10; 1.9.2 Requires-Python >=3.10
ERROR: Could not find a version that satisfies the requirement smolagents (from versions: none)
Is there any way to use Python 3.10 in this repository? If not, one option is to make MLflow support Python 3.10, but I think this needs to be done carefully so that it won’t cause any breaking changes due to incompatibility. Could you share your opinion?
@y-okt MLflow supports Python 3.10 - 3.9 is a minimal version (ref). You can set up python 3.10 based environment locally.
Later you also need to configure CI tests as such. The majority of CI job runs on 3.9, but you can configure it to run on Python 3.10. For example, this line configures that OpenAI autologging tests to run with python 3.10 when the openai SDK is >= 1.33.
@B-Step62 Got it, thank you for your advice!
@B-Step62 Hi, I'm still encountering the following errors, and I suspect there is something wrong with my logic of on_start
and on_end
. Could you please review this? (code) I'm suspecting this bug is related to how the request_id
is populated into end_span
.
Error 1: this line claiming TypeError: the JSON object must be str, bytes or bytearray, not NoneType
.
Error 2: the error around end_span
. "mlflow.exceptions.MlflowException: Span with ID 12036168015234431773 is not found or already finished."
@y-okt Is it possible to share the full stack trace?
From the surface level, I suspect this is not a proper way to retrieve request ID.
if span._parent is None:
request_id = str(span.context.trace_id) # Use otel-generated trace_id as request_id
else:
request_id = self._trace_manager.get_request_id_from_trace_id(span.context.trace_id)
The span returned from start_span
/ start_trace
inside on_start
should have the request ID, and we should use that here. This means that we probably need to store the mapping between OTel trace ID to request ID mapping in the span processor as well, similarly to the token.
@B-Step62 Thank you for your suggestion, I'll try that .This is the full log, which is extremely long so I pasted it in Google doc. Thanks!
After changing that, the first error still exists, and it looks like it's due to the lack of SpanAttributeKey.REQUEST_ID
in the attributes of otel_span (code). This is triggered by start_trace
(code). I also checked the contents of attributes (pasted below), which doesn't have request_id field.
One solution seems to be setting request_id
attribute directly to otel_span, but I found one comment in the file stating that the user shouldn't directly write into the attributes of otel_span (comment).
attributes {'input.value': '{"task": "Could you give me the 118th number in the Fibonacci sequence?", "stream": false, "reset": true, "images": null, "additional_args": null, "max_steps": null}', 'smolagents.max_steps': 20, 'smolagents.tools_names': ('web_search', 'visit_webpage', 'final_answer')}
Hmm I see.
This is triggered by start_trace (code).
This raises me another issue indeed. So our assumption was that the mlflow.get_current_active_span
returns None
when we first triggers on_start
. However, it seems the assumption was wrong. OpenInference already started an Otel span and set it to an active span. Therefore, otel_span = trace_api.get_current_span()
returns that span, which doesn't have request ID.
It seems we shouldn't rely on mlflow.get_current_active_span
, instead let OpenInference to construct the parent-child relationship.
def on_start(span):
otel_parent_id = span.parent.span_id
if otel_parent_id is not None:
parent_mlflow_span = self._otel_span_id_to_mlflow_span[otel_parent_id]
# Call client.start_span() with that parent
else:
# Call client.start_trace()
self._otel_span_id_to_mlflow_span[span.context.span_id) = otel_parent_id
Can we try if this approach works?
If this doesn't work, we may want to consider patch based approach too.... Looking at SmolAgentsInstrumentator, the number of things we need to patch is not too many.
@B-Step62 I changed the logic (code), but for some reason, the result of tests is wrong, giving the length of trace as 0 🤔 . Also, parent_mlflow_span.request_id
and parent_mlflow_span_parent_id
's values are MLFLOW_NO_OP_SPAN_REQUEST_ID
and None
. Probably this is the cause. I'm not sure why these values always become same, but probably it's set by MlflowClient().start_span
and MlflowClient().start_trace
.
def test_smolagents_invoke_simple(monkeypatch, autolog):
# with patch.object(InferenceClientModel, "__call__", return_value=DUMMY_OUTPUT):
autolog()
monkeypatch.setattr(
"smolagents.InferenceClientModel.__call__",
lambda self, *args, **kwargs: DUMMY_OUTPUT,
)
model = InferenceClientModel(model_id="gpt-3.5-turbo", token="test_id")
agent = CodeAgent(tools=[], model=model, add_base_tools=True)
agent.run("Could you give me the 118th number in the Fibonacci sequence?")
print("finished agent run")
traces = get_traces()
> assert len(traces) == 1
E assert 0 == 1
E + where 0 = len([])
agent = <smolagents.agents.CodeAgent object at 0x144a1ae30>
autolog = <function autolog at 0x144a1d120>
model = <smolagents.models.InferenceClientModel object at 0x144a1a110>
monkeypatch = <tests.conftest.ExtendedMonkeyPatch object at 0x144a1abf0>
traces = []
If this doesn't work, we may want to consider patch based approach too....
Thank you for your suggestion, I'll proceed with the patched method instead.
@B-Step62 Given this huge system design change, I'm not sure if I can finish this by today. May I submit a PR tomorrow? (I'll try my best, but I'm not sure if I can finish it by tomorrow because I'll be working tomorrow....)
Also, parent_mlflow_span.request_id and parent_mlflow_span_parent_id 's values are MLFLOW_NO_OP_SPAN_REQUEST_ID and None.
I see, MLFLOW_NO_OP_SPAN_REQUEST_ID
indicates sth is wrong in the OTel span creation process.
@B-Step62 Given this huge system design change, I'm not sure if I can finish this by today. May I submit a PR tomorrow? (I'll try my best, but I'm not sure if I can finish it by tomorrow because I'll be working tomorrow....)
Sure no problem, I understand this is eating your free time, thank you so much for dedicating effort on this! Do you think you can file the PR in this week?
For the patch based approach, you can refer to the crewai tracing logic.
Thank you for your understanding, and advice! Yes, I think I can file a PR within this week. @B-Step62
@B-Step62 Created a PR. Could you please review? I'll add documentations later.
Moreover, when I ran mlflow locally by pip install -e <local path>/mlflow
, run autolog (I tested using sklearn's autolog) and then run pip install mlflow
(either v2 or v3), and finally mlflow ui
, it fails with the following error. Since the functions I tested isn't smolagents' one and completely irrelevant stuff (sklearn), and without locally installing but using pip install mlflow
, it works, I'm suspecting this is a bug on local development. Due to this, I can't verify locally on UI other than unit testing. Was this observed in other people's environment? I couldn't find it online.
Traceback (most recent call last):
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
response = self.full_dispatch_request()
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 590, in wrapper
return func(*args, **kwargs)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 631, in wrapper
return func(*args, **kwargs)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 1053, in _search_runs
response_message = search_runs_impl(request_message)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 1069, in search_runs_impl
run_entities = _get_tracking_store().search_runs(
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/abstract_store.py", line 576, in search_runs
runs, token = self._search_runs(
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 1010, in _search_runs
run_infos = self._list_run_infos(experiment_id, run_view_type)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 967, in _list_run_infos
run_info = self._get_run_info_from_dir(r_dir)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 758, in _get_run_info_from_dir
return _read_persisted_run_info_dict(meta)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 167, in _read_persisted_run_info_dict
return RunInfo.from_dictionary(dict_copy)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/entities/_mlflow_object.py", line 27, in from_dictionary
return cls(**filtered_dict)
TypeError: RunInfo.__init__() missing 1 required positional argument: 'run_uuid'
2025/05/01 12:10:40 ERROR mlflow.server: Exception on /ajax-api/2.0/mlflow/runs/search [POST]
Traceback (most recent call last):
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
response = self.full_dispatch_request()
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 590, in wrapper
return func(*args, **kwargs)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 631, in wrapper
return func(*args, **kwargs)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 1053, in _search_runs
response_message = search_runs_impl(request_message)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 1069, in search_runs_impl
run_entities = _get_tracking_store().search_runs(
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/abstract_store.py", line 576, in search_runs
runs, token = self._search_runs(
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 1010, in _search_runs
run_infos = self._list_run_infos(experiment_id, run_view_type)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 967, in _list_run_infos
run_info = self._get_run_info_from_dir(r_dir)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 758, in _get_run_info_from_dir
return _read_persisted_run_info_dict(meta)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 167, in _read_persisted_run_info_dict
return RunInfo.from_dictionary(dict_copy)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/entities/_mlflow_object.py", line 27, in from_dictionary
return cls(**filtered_dict)
TypeError: RunInfo.__init__() missing 1 required positional argument: 'run_uuid'
2025/05/01 12:10:40 ERROR mlflow.server: Exception on /ajax-api/2.0/mlflow/runs/search [POST]
Traceback (most recent call last):
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
response = self.full_dispatch_request()
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 590, in wrapper
return func(*args, **kwargs)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 631, in wrapper
return func(*args, **kwargs)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 1053, in _search_runs
response_message = search_runs_impl(request_message)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/server/handlers.py", line 1069, in search_runs_impl
run_entities = _get_tracking_store().search_runs(
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/abstract_store.py", line 576, in search_runs
runs, token = self._search_runs(
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 1010, in _search_runs
run_infos = self._list_run_infos(experiment_id, run_view_type)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 967, in _list_run_infos
run_info = self._get_run_info_from_dir(r_dir)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 758, in _get_run_info_from_dir
return _read_persisted_run_info_dict(meta)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/store/tracking/file_store.py", line 167, in _read_persisted_run_info_dict
return RunInfo.from_dictionary(dict_copy)
File "/Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/entities/_mlflow_object.py", line 27, in from_dictionary
return cls(**filtered_dict)
TypeError: RunInfo.__init__() missing 1 required positional argument: 'run_uuid'
@B-Step62 I created a completely new repository (without any changes) locally, and used that for autologging sklearn, but the same error happened...
@y-okt Do you have mlruns
directory in the location where you run mlflow ui
? I haven't seen the error, but it indicates the persisted Run record is in malformed format. We probably need to clean up all those artifacts not only re-installing mlflow.
@B-Step62 Yes, I have mlruns
directory and even after cleaning up mlruns
-> install locally and run python script for sklearn autolog -> reinstall mlflow@3 -> run mlflow UI, it didn't succeed
@y-okt Can you verify your local branch is based on the latest master
? The run_uuid
field in the error message was removed from RunInfo
object in recent push of 3.0 branch, so I suspect that change is not reflected on the local branch.
@B-Step62 Yes, I pulled the latest master
but still no luck. However, https://github.com/mlflow/mlflow/pull/15342 seems to be not included in the v3.0.0rc0
and thus,
v3.0.0rc0
, it failsyarn start
, this error happens (https://github.com/mlflow/mlflow/issues/10942)As 1 seems unable to be solved as long as #15342 hasn't been released to any version yet, I think I need to solve 2.
@y-okt What does cat /Users/y-okt/ghq/github.com/y-okt/mlflow/.venvs/mlflow-dev/lib/python3.10/site-packages/mlflow/entities/run_info.py
output?
As 1 seems unable to be solved as long as https://github.com/mlflow/mlflow/pull/15342 hasn't been released to any version yet, I think I need to solve 2.
@B-Step62 Thank you! Upgrading to 3.0.0rc1 made it succeed!
Willingness to contribute
No. I cannot contribute this feature at this time.
Proposal Summary
Summary
Expand auto-tracing integration of MLflow Tracing to smolagents by HuggingFace.
Required Changes
Please refer to how to add new integration to MLflow Tracing for the actual steps of adding new auto-tracing integration. MLflow maintainers will provide attentive support for designing and implementing the change.
Expected Behavior
mlflow.smolagents.autolog()
Enable auto-tracing
mlflow.smolagents.tracing()
from smolagents import ( CodeAgent, ToolCallingAgent, DuckDuckGoSearchTool, VisitWebpageTool, InferenceClientModel, )
model = InferenceClientModel()
search_agent = ToolCallingAgent( tools=[DuckDuckGoSearchTool(), VisitWebpageTool()], model=model, name="search_agent", description="This is an agent that can do web search.", )
manager_agent = CodeAgent( tools=[], model=model, managed_agents=[search_agent], ) manager_agent.run( "If the US keeps its 2024 growth rate, how many years will it take for the GDP to double?" )