neeek2303 / EMOPortraits

Official implementation of EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars
311 stars 17 forks source link

missing files for training / confusion on va forward pass #27

Open johndpope opened 1 week ago

johndpope commented 1 week ago
                    image_list = [f'{args.project_dir}/data/one.png', f'{args.project_dir}/data/ton_512.png', f'{args.project_dir}/data/two.png',
                                  f'{args.project_dir}/data/asim_512.png']
                    mask_list = [f'{args.project_dir}/data/j1_mask.png', f'{args.project_dir}/data/j1_mask.png', f'{args.project_dir}/data/j1_mask.png',
                                 f'{args.project_dir}/data/j1_mask.png']

do these have alpha channels?

i spent yesterday going through the infer.py + va.py i'm confused why the model doesn't actually call the def forward -> G_forward anywhere on the va.py model. did someone else write this inference code? it seems over compliated...

these are the interactions with the model from infer.py

it seems like the G_forward_old - was an attempt to consolidate this logic.

                face_mask_source, _, _, cloth_s = self.model.face_idt.forward(source_img_crop)
   self.idt_embed = self.model.idt_embedder_nw.forward_image(source_img_crop * source_img_mask)
                source_latents = self.model.local_encoder_nw(source_img_crop * source_img_mask)                                 pred_source_theta = self.model.head_pose_regressor.forward(source_img_crop)
                grid = self.model.identity_grid_3d.repeat_interleave(1, dim=0)
                source_warp_embed_dict, _, _, embed_dict = self.model.predict_embed(data_dict)
                xy_gen_outputs = self.model.xy_generator_nw(source_warp_embed_dict)                    pred_target_theta, scale, rotation, translation = self.model.head_pose_regressor.forward(driver_img_crop, True)
                    source_xy_warp_resize = self.model.resize_warp_func(
                target_latent_volume = self.model.grid_sample(                self.target_latent_volume = self.model.volume_process_nw(self.target_latent_volume_1, embed_dict)
                grid = self.model.identity_grid_3d.repeat_interleave(1, dim=0)
                data_dict = self.model.expression_embedder_nw(data_dict, True, False)
                _, target_warp_embed_dict, _, embed_dict = self.model.predict_embed(data_dict)
                target_uv_warp, data_dict['target_delta_uv'] = self.model.uv_generator_nw(target_warp_embed_dict)
                    target_uv_warp_resize = self.model.resize_warp_func(target_uv_warp_resize)
                aligned_target_volume = self.model.grid_sample(
                img, _, deep_f, img_f = self.model.decoder_nw(data_dict, embed_dict, target_latent_feats, False,

the other thinking I'm not certain on is around megaportraits implementation -

"These losses are calculated using only foreground regions in both predictions and the ground truth."

I'm attempting to achieve high fps / for recreating VASA paper. the infer.py seems to hit around 14fps.

is the gbase - supposed to have the modnet in baked in so it's always extracting the masks? did emo add the face parsing? could it be slowing things down a lot? UPDATE - i idid find the ModNet in the paper - https://github.com/johndpope/MegaPortrait-hack/issues/59

was there ever a megaportraits FPS benchmarking....I thought it could do inference in real time - or maybe its just VASA.