Multi speaker Multi shot movie like videos

hvishal512 commented 3 years ago

Hi, @joonson

Thanks for open sourcing this amazing work. I'm able to test the pre-trained SyncNet model on a single speaker, single-shot video. However, when there are 2 or more speakers including multiple scenes, and when run_pipeline.py is used, the frames are extracted into the REFERENCE folder, but pycrop is empty. The pycrop folder being empty is probably the reason for the syncnet model being uploaded, but not resulting in any output when run_syncnet.py is deployed. I came across an issue opened in the GitHub repo regarding multiple speaker detection and it was clarified there that it indeed works for multi-speaker frames. But when I run_pipeline.py on my video, it is not able to detect multiple speakers and keep track of them across multiple scenes (it recognizes this, pycrop is empty). Can you please share some insight on what I might do to fix this? First of all is it possible to predict the AV offset using SyncNet in such a scenario where the videos are movie-like. Thank you.

hrzisme commented 3 years ago

This is a hint that some of your parameters have not been adjusted properly

sunotsue commented 2 years ago

did you figure this out? my pycrop folder is also empty and the offset.txt file also doesn't get created

hvishal512 commented 2 years ago

Hi @sunotsue, as @hrzisme suggested, setting the right parameters results in the pycrop folder having an output. If I remember correctly, I think it was a parameter that controls the length of the cropped video clips. Try to set it higher (5s+ or whichever works for you).

dizhenx commented 1 year ago

How to solve it?

dizhenx commented 1 year ago

my pycrop folder is also empty and the offset.txt file also doesn't get created

parser.add_argument('--data_dir', type=str, default='data/work', help='Output direcotry'); parser.add_argument('--videofile', type=str, default='', help='Input video file'); parser.add_argument('--reference', type=str, default='', help='Video reference'); parser.add_argument('--facedet_scale', type=float, default=0.25, help='Scale factor for face detection'); parser.add_argument('--crop_scale', type=float, default=0.40, help='Scale bounding box'); parser.add_argument('--min_track', type=int, default=100, help='Minimum facetrack duration'); parser.add_argument('--frame_rate', type=int, default=25, help='Frame rate'); parser.add_argument('--num_failed_det', type=int, default=25, help='Number of missed detections allowed before tracking is stopped'); parser.add_argument('--min_face_size', type=int, default=100, help='Minimum face size in pixels'); There was not a parameter that controls the length of the cropped video clips. My video did not take longer than 5 seconds, and this issue also occurred.So it may not be related to the length of the video.How to solve it?

joonson / syncnet_python

Multi speaker Multi shot movie like videos #38