kijai / ComfyUI-LivePortraitKJ

ComfyUI nodes for LivePortrait
MIT License
744 stars 51 forks source link

Got it working on MacBook (MPS) #33

Open Grant-CP opened 1 week ago

Grant-CP commented 1 week ago

I have a version of these nodes working via MPS for those with macbooks. On my M1 Pro 32GB it took 60 seconds for 32 frames and 650 seconds for 600 frames. So about 1 second per animation frame.

Repo is here. I welcome all pull requests from other macbook users. https://github.com/Grant-CP/ComfyUI-LivePortraitKJ-MPS

For @kijai I'm not sure that these changes can be added into the main as there are a few places where I changed numerical precision. So I would expect my version to have slightly worse performance. I am more than happy to talk about my changes if you want to start supporting MPS as a back end when you make nodes. Sorry for not forking correctly. I'm still learning how to use git.

tryx78 commented 1 week ago

for me now work thanks with @Grant-CP repo thanks i use M3 Pro 14inch - 18gb ram

Grant-CP commented 1 week ago

@tryx78 Glad it's working for you! Can you tell me how long it takes? Just the overall time for prompt evaluation and how many frames you set in the video import node.

Also, can you confirm that it runs just fine without PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py to set the proper fallback for comfyui? I believe that was necessary on my M1, but I forgot to write it in the repo originally. @melMass I'm not sure if setting an environment variable like this fits into your patch solution.

@melMass Thanks. Good idea with the patch. A change I didn't make is that the code is still using CUDA as the execution provider for onnx networks in that file. I think it will just fallback to CPU, but it would be nice to not have the error message any more. I know in comfyui there is also the CoreML execution provider which I believe is fairly new but would be great to use if it works.

tryx78 commented 1 week ago

@Grant-CP now i have this error Error occurred when executing LivePortraitProcess: The operator 'aten::grid_sampler_3d' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

how fix this?

tryx78 commented 1 week ago

@Grant-CP 48 fps in 38.80 seconds with PYTORCH_ENABLE_MPS_FALLBACK=1 48 fsp in 53.46 seconds whitout PYTORCH_ENABLE_MPS_FALLBACK=1

Grant-CP commented 1 week ago

@tryx78 Thanks so much! I added the Pytorch fallback to the README of my repo.

Good to see that your M3 is so much faster. Let me know if you run into any other issues!

kijai commented 1 week ago

I changed all the hardcoded cuda stuff to use the comfy detection, and put that one tensor operation in try/except block as it probably fails on MPS, but I can't test if that's enough.

Grant-CP commented 1 week ago

@kijai Great I've looked through the changes and I will test them out later.

In line 62 of liveportrait/modules/dense_motion.py, if you want to change the code for the assertion error I believe the error is: AttributeError: module 'torch.mps' has no attribute 'FloatTensor'.

Its a stupid error that's been in PyTorch for multiple years. Presumably the error will change from AttributeError to something more reasonable in a future version of pytorch.

Grant-CP commented 1 week ago

@kijai I had a chance to test it and I get a torch autocast error. This might be a bug on comfy's end as it seems to me like their get_autocast_device should handle mps not being supported. See the error here:

Traceback (most recent call last):
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/nodes.py", line 283, in process
    cropped_frames, full_frame = pipeline.execute(img, driving_images_np)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 53, in execute
    x_s_info = self.live_portrait_wrapper.get_kp_info(I_s)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 95, in get_kp_info
    with torch.autocast(device_type=get_autocast_device(self.device_id), dtype=torch.float16, enabled=self.cfg.flag_use_half_precision):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'

In my repo I was only able to get it working by disabling autocast entirely (not even setting device to cpu works). For example:

with torch.no_grad():
            #HACK
            # with torch.autocast(device_type='cpu', dtype=torch.float16, enabled=self.cfg.flag_use_half_precision):
            #     feature_3d = self.appearance_feature_extractor(x)
            feature_3d = self.appearance_feature_extractor(x)

Again, I'm not sure if there's a way to fix this with comfy's autocast manager. I won't have time to look into that today.

kijai commented 1 week ago

@kijai I had a chance to test it and I get a torch autocast error. This might be a bug on comfy's end as it seems to me like their get_autocast_device should handle mps not being supported. See the error here:

Traceback (most recent call last):
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/nodes.py", line 283, in process
    cropped_frames, full_frame = pipeline.execute(img, driving_images_np)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 53, in execute
    x_s_info = self.live_portrait_wrapper.get_kp_info(I_s)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 95, in get_kp_info
    with torch.autocast(device_type=get_autocast_device(self.device_id), dtype=torch.float16, enabled=self.cfg.flag_use_half_precision):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'

In my repo I was only able to get it working by disabling autocast entirely (not even setting device to cpu works). For example:

with torch.no_grad():
            #HACK
            # with torch.autocast(device_type='cpu', dtype=torch.float16, enabled=self.cfg.flag_use_half_precision):
            #     feature_3d = self.appearance_feature_extractor(x)
            feature_3d = self.appearance_feature_extractor(x)

Again, I'm not sure if there's a way to fix this with comfy's autocast manager. I won't have time to look into that today.

I've done it like that before too yeah, can just the manager mps detection to skip the whole autocast conditionally.

kijai commented 1 week ago

@Grant-CP Can you try now? Skipping the whole autocast based on the dtype.

Grant-CP commented 1 week ago

@kijai I'm getting the same error. Truncated error is below. I don't believe I've set any half_precision flags so I assume that comes from elsewhere in the code. I am using the fp16 models though, as I assume most people would.

do you think try:... except RuntimeError: would be bad? Another option would be to use torch.backends.mps.is_available(). That should return true only on macbooks. Not sure if there's another issue you are trying to solve with moving the half_precision flag out of the autocast() call though.

File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 53, in execute
    x_s_info = self.live_portrait_wrapper.get_kp_info(I_s)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 94, in get_kp_info
    with torch.autocast(get_autocast_device(self.device_id), dtype=torch.float16) if self.cfg.flag_use_half_precision else nullcontext():
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'
kijai commented 1 week ago

@kijai I'm getting the same error. Truncated error is below. I don't believe I've set any half_precision flags so I assume that comes from elsewhere in the code. I am using the fp16 models though, as I assume most people would.

do you think try:... except RuntimeError: would be bad? Another option would be to use torch.backends.mps.is_available(). That should return true only on macbooks. Not sure if there's another issue you are trying to solve with moving the half_precision flag out of the autocast() call though.

File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_pipeline.py", line 53, in execute
    x_s_info = self.live_portrait_wrapper.get_kp_info(I_s)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/live_portrait_wrapper.py", line 94, in get_kp_info
    with torch.autocast(get_autocast_device(self.device_id), dtype=torch.float16) if self.cfg.flag_use_half_precision else nullcontext():
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 241, in __init__
    raise RuntimeError(
RuntimeError: User specified an unsupported autocast device_type 'mps'

This is with fp32 selected though? I should probably automate that.

Grant-CP commented 1 week ago

@kijai my apologies. I had the model loader pipeline set to fp16. So the idea of this code is to force MPS users to use fp32 for these particular models because MPS doesn't support autocast. My mistake! I think some other parts of comfyui will print a message saying "mixed precision is not supported on this device, reverting to full precision" or something like that. If you wanted to be nice to mac users you could do a check of cfg.flag_use_half_precision and if it's true then throw a descriptive error message.

It will be interesting to test whether setting it to fp32 will hurt performance. I'm not sure what the default was in the original repo, but I assume it was fp16 since I was also running into an error on this line originally?

Anyways, we are down to the last error, which is:

ile "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/modules/dense_motion.py", line 81, in forward
    deformed_feature = self.create_deformed_feature(feature, sparse_motion)  # (bs, 1+num_kp, c=4, d=16, h=64, w=64)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/modules/dense_motion.py", line 50, in create_deformed_feature
    sparse_deformed = F.grid_sample(feature_repeat, sparse_motions, align_corners=False)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/torch/nn/functional.py", line 4353, in grid_sample
    return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The operator 'aten::grid_sampler_3d' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

I solved this one by running my comfyui with PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py. One user near the top of this thread sounded like they maybe got it to work more slowly without this flag on their M3 macbook? They weren't totally clear.

An option might be to move both feature_repeat and sparse_motion to cpu before this operation? I assume that's what the fallback flag does. Unfortunately the pytorch function grid_sample looks like just a wrapper for non-python code.

Grant-CP commented 1 week ago

@kijai I can confirm that the node works with the env flag set. So your node is just as functional. I do think it is marginally slower, maybe because of the fp32 setting on the model loader? For example 57 seconds originally vs 60 seconds now for 32 frames.

I think it would be good to skip the autocast context on MPS, rather then relying on the MPS users loading everything into fp32. I'm going to check and see if I can skip the fallback call though

kijai commented 1 week ago

@kijai I can confirm that the node works with the env flag set. So your node is just as functional. I do think it is marginally slower, maybe because of the fp32 setting on the model loader? For example 57 seconds originally vs 60 seconds now for 32 frames.

I think it would be good to skip the autocast context on MPS, rather then relying on the MPS users loading everything into fp32. I'm going to check and see if I can skip the fallback call though

I added auto option as precision choose to the loader, all it does is change the flag. With fp32 the autocast context should be skipped.

kijai commented 1 week ago

I have always been confused if MPS supports fp16 at all, from what I understand it should, but it's just torch autocasts that don't? If we can't use autocast it would probably need more work overall to manually cast stuff.

Also I think fp32 is just default in this code until the autocasts anyway. I'm not really an expert in all that.

Grant-CP commented 1 week ago

@kijai I'm pretty sure mps supports float16. See the following code:

a = torch.Tensor([[1.,1.]]).to('mps')
a = a.to(torch.float16)
print(a.dtype)
a = a.to(torch.bfloat16)
print(a.dtype)
print(a.device)
#out
torch.float16
torch.bfloat16
mps:0

I'm pretty sure some stuff was being done in float16 before since your node is slightly slower which I confirmed at a few more sizes. It could be related to other parts of the code maybe though.

Grant-CP commented 1 week ago

@kijai I got it working without the flag. Again it's a little bit slower than running my repo with the fallback flag, but I suspect it is because of the precision. For example, originally 600 frames was 650 seconds, now it is 692 seconds. Not the worst drop though. See the following code:

In dense_motion.py line 50

def create_deformed_feature(self, feature, sparse_motions):
        bs, _, d, h, w = feature.shape
        feature_repeat = feature.unsqueeze(1).unsqueeze(1).repeat(1, self.num_kp+1, 1, 1, 1, 1, 1)      # (bs, num_kp+1, 1, c, d, h, w)
        feature_repeat = feature_repeat.view(bs * (self.num_kp+1), -1, d, h, w)                         # (bs*(num_kp+1), c, d, h, w)
        sparse_motions = sparse_motions.view((bs * (self.num_kp+1), d, h, w, -1))                       # (bs*(num_kp+1), d, h, w, 3)
        #HACK
        if torch.backends.mps.is_available():
            print('converting mps tensors to cpu for grid_sample')
            feature_repeat = feature_repeat.to('cpu')
            sparse_motions = sparse_motions.to('cpu')
            sparse_deformed = F.grid_sample(feature_repeat, sparse_motions, align_corners=False).to('mps')
        else:
        #HACK END
            sparse_deformed = F.grid_sample(feature_repeat, sparse_motions, align_corners=False)
        sparse_deformed = sparse_deformed.view((bs, self.num_kp+1, -1, d, h, w)) 

In util.py line 157

def forward(self, x):
        out = self.conv(x)
        out = self.norm(out)
        out = F.relu(out)
        #HACK
        try:
            out = self.pool(out)
        except NotImplementedError:
            out = self.pool(out.to('cpu')).to('mps')
        #HACK End
        return out

In warping_network.py line 12, then line 49

#HACK
import torch.backends.mps as mps

#line 49
def deform_input(self, inp, deformation):
        #HACK
        if mps.is_available():
            return F.grid_sample(inp.to('cpu'), deformation.to('cpu'), align_corners=False).to('mps')
        #HACK END
        return F.grid_sample(inp, deformation, align_corners=False)

I assume replacing the mps.is_available() call with a comfyui alternative would be good. From brief testing this seems about the same speed as the pytorch fallback flag?

A better code option might be to create a wrapper for F.grid_sample that puts the inputs to cpu when on mps, then returns an mps tensor. Or to define higher up a mps_grid_sample that does the same and import and use it in the two places where we need it. The other unsupported necessary operation is nn.AvgPool3d which is pretty silly not to be supported. I don't think I'll have time in the next few weeks to rewrite the code to not need the 3d pool. the grid_sample I don't understand.

Grant-CP commented 1 week ago

@kijai If there's a good way to set that fallback environment variable just for the execution of this node, then that's probably a better way to make this future proof. My manual swapping of tensors doesn't seems to be faster (or much slower surprisingly?).

Another better way to write my code would be to have all three block use the except NotImplementedError. I think I like that the best and that will set us up well for the future.

cchance27 commented 1 week ago

Silly question but given the small size of insightface and liveportrait models, wondering if anyone has tried to do a conversion to convert them over to CoreML as the ANE would likely be the fastest way to run things no?

Grant-CP commented 1 week ago

@cchance27 Can you point to any other places where you've seen that done? I have no idea where to start, but I had thought the same thing. CoreML requires using Swift right?

x4080 commented 1 week ago

@Grant-CP no, python can call c code for coreml I think

Edit: using coremltools

Grant-CP commented 6 days ago

@x4080 Thanks I read through that a bit. It looks like converting from onnx to CoreML is a little annoying at the moment but definitely very possible.

I also see in https://github.com/deepinsight/insightface/issues/2238 that insightface onnx seems to be supported with CoreML as the execution provider on recent versions of the onnx runtime. I'll probably try that as a first option. I doubt it's as fast as full conversion and compilation, but I bet it's way better than CPUExecutionProvider.

Grant-CP commented 6 days ago

Looks like CoreMLExecution provider can work for some parts, but not for the main FaceAnalysisDIY call. I get the error below. It sounds to me like CoreML expects statically sized tensors at each step of the way. Where the CoreML runtime is getting the idea for what size this tensor should be, I have no idea. I tried adjusting the size of the input image, input video, and number of video frames, and none of them changed the size of this tensor. I also made sure I had the latest onnx and onnxruntime

So moral of the story is I'm not going to work on CoreML for now. While the cropper can work on coreml, the main meat of the process cannot. I think the main changes that we will make with @kijai is to try and support fp16 on mac, and to make inference work without the fallback flag, both of which have implementations described in this thread.

File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/utils/face_analysis_diy.py", line 47, in get
    bboxes, kpss = self.det_model.detect(img_bgr, max_num=max_num, metric='default')
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py", line 224, in detect
    scores_list, bboxes_list, kpss_list = self.forward(det_img, self.det_thresh)
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py", line 152, in forward
    net_outs = self.session.run(self.output_names, {self.input_name : blob})
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running CoreML_1312385619456144913_6 node. Name:'CoreMLExecutionProvider_CoreML_1312385619456144913_6_6' Status Message: Exception: /Users/runner/work/1/s/onnxruntime/core/providers/coreml/model/model.mm:71 InlinedVector<int64_t> (anonymous namespace)::GetStaticOutputShape(gsl::span<const int64_t>, gsl::span<const int64_t>, const logging::Logger &) inferred_shape.size() == coreml_static_shape.size() was false. CoreML static output shape ({1,1,1,2048,1}) and inferred shape ({3200,1}) have different ranks.
kijai commented 6 days ago

Looks like CoreMLExecution provider can work for some parts, but not for the main FaceAnalysisDIY call. I get the error below. It sounds to me like CoreML expects statically sized tensors at each step of the way. Where the CoreML runtime is getting the idea for what size this tensor should be, I have no idea. I tried adjusting the size of the input image, input video, and number of video frames, and none of them changed the size of this tensor. I also made sure I had the latest onnx and onnxruntime

So moral of the story is I'm not going to work on CoreML for now. While the cropper can work on coreml, the main meat of the process cannot. I think the main changes that we will make with @kijai is to try and support fp16 on mac, and to make inference work without the fallback flag, both of which have implementations described in this thread.

File "/Users/grant/Documents/Repos/ComfyUI/custom_nodes/ComfyUI-LivePortraitKJ/liveportrait/utils/face_analysis_diy.py", line 47, in get
    bboxes, kpss = self.det_model.detect(img_bgr, max_num=max_num, metric='default')
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py", line 224, in detect
    scores_list, bboxes_list, kpss_list = self.forward(det_img, self.det_thresh)
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/insightface/model_zoo/retinaface.py", line 152, in forward
    net_outs = self.session.run(self.output_names, {self.input_name : blob})
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/comfyui/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running CoreML_1312385619456144913_6 node. Name:'CoreMLExecutionProvider_CoreML_1312385619456144913_6_6' Status Message: Exception: /Users/runner/work/1/s/onnxruntime/core/providers/coreml/model/model.mm:71 InlinedVector<int64_t> (anonymous namespace)::GetStaticOutputShape(gsl::span<const int64_t>, gsl::span<const int64_t>, const logging::Logger &) inferred_shape.size() == coreml_static_shape.size() was false. CoreML static output shape ({1,1,1,2048,1}) and inferred shape ({3200,1}) have different ranks.

So the onnx option should include CoreLM then? The version currently in dev branch separates the process anyway to the cropper and the rest.

Grant-CP commented 6 days ago

@kijai My mistake the cropper is the one with FaceAnalysisDIY, and that fails with CoreML. It's the LandmarkRunner that can handle CoreML, although there's no speedup, and it might just be falling back to cpu anyway. So I would not suggest at this time making CoreML an option. It might be reducing heat production on my macbook but that's hard for me to measure, there's no speed up.

If I set the onnx provider to CoreML (screenshot below) then that is when I get the error. I can hardcode the LandmarkRunner to the CoreML (regardless of node provider choice) and that works, but either choosing CoreML in the node or hardcoding the FaceAnalysisDIY to use CoreML results in the same error message above.

image
class LandmarkRunner(object):
    """landmark runner"""
    def __init__(self, **kwargs):
        ckpt_path = kwargs.get('ckpt_path')
        onnx_provider = kwargs.get('onnx_provider', 'cuda')  # 默认用cuda
        device_id = kwargs.get('device_id', 0)
        self.dsize = kwargs.get('dsize', 224)
        self.timer = Timer()

        #HACK
        self.session = onnxruntime.InferenceSession(
                ckpt_path, providers=[
                    ('CoreMLExecutionProvider', {'device_id': device_id})
                ]
            )

        # if onnx_provider.lower() == 'cuda':
        #     self.session = onnxruntime.InferenceSession(
Grant-CP commented 6 days ago

Here's the piece of the cropper that errors. If I select "CoreML" in the node or if I hard code it (commented line) I believe the same exact code gets run.

self.face_analysis_wrapper = FaceAnalysisDIY(
            name='buffalo_l',
            root=os.path.join(folder_paths.models_dir, 'insightface'),
            #HACK
            #providers = ['CoreMLExecutionProvider']
            providers=[provider + 'ExecutionProvider',]
        )
cchance27 commented 5 days ago

Sorry haven't been around to respond since i opened the ANE/CoreML rabbit hole :)

If you want to monitor ANE usage and GPU you can use asitop while the run is happening to see if it's executing on what during the inferrence.

I get the error below. It sounds to me like CoreML expects statically sized tensors at each step of the way.

yes CoreML is normally statically sized, the coreml models can be compiled with a set of tensor sizes supported, or can be compiled to support a ranged set but i imagine that might not work using onnx without a pre-conversion.

Grant-CP commented 5 days ago

@cchance27 Thanks for letting me know about asitop. Looks like it might not totally support sonoma, but I'll give it a try.

I wasn't able to find info about the interaction between static sizes and onnx-runtime. My imagination would be that I would try looking in the onnx graph/metadata itself and editing it and seeing how the error message changes.

I also imagine that the onnx-runtime doesn't get to use the ANE since it sounds like CoreML programs have to be specifically compiled with support for it?

Creative-comfyUI commented 14 hours ago

Is there a solution for the bug "The operator 'aten::grid_sampler_3d' is currently not implemented for the MPS device" without affecting the speed ? (Mac M2) thanks