Closed ekiwi111 closed 1 year ago
Hi, maybe ffmpeg
is not installed. In the README, Note that in order to display the video properly in HTML, you need to compile ffmpeg manually with H.264
. If ffmpeg
is not properly installed, video files in H.263 format will not be generated.
I don't believe that's the issue. When the command LD_LIBRARY_PATH=/usr/local/lib /usr/local/bin/ffmpeg -i input.mp4 -vcodec libx264 output.mp4
is executed (with the appropriate input mp4 file), it runs correctly.
Here's the content of debug.log
:
2023-04-06 17:18:19,832 - awesome_chat - INFO - ********************************************************************************
2023-04-06 17:18:19,833 - awesome_chat - INFO - input: based on the /examples/a.jpg, please generate a video and audio
2023-04-06 17:18:19,834 - awesome_chat - DEBUG - [{'role': 'system', 'content': '#1 Task Planning Stage: The AI assistant can parse user input to several tasks: [{"task": task, "id": task_id, "dep": dependency_task_id, "args": {"text": text or <GENERATED>-dep_id, "image": image_url or <GENERATED>-dep_id, "audio": audio_url or <GENERATED>-dep_id}}]. The special tag "<GENERATED>-dep_id" refer to the one genereted text/image/audio in the dependency task (Please consider whether the dependency task generates resources of this type.) and "dep_id" must be in "dep" list. The "dep" field denotes the ids of the previous prerequisite tasks which generate a new resource that the current task relies on. The "args" field must in ["text", "image", "audio"], nothing else. The task MUST be selected from the following options: "token-classification", "text2text-generation", "summarization", "translation", "question-answering", "conversational", "text-generation", "sentence-similarity", "tabular-classification", "object-detection", "image-classification", "image-to-image", "image-to-text", "text-to-image", "text-to-video", "visual-question-answering", "document-question-answering", "image-segmentation", "depth-estimation", "text-to-speech", "automatic-speech-recognition", "audio-to-audio", "audio-classification", "canny-control", "hed-control", "mlsd-control", "normal-control", "openpose-control", "canny-text-to-image", "depth-text-to-image", "hed-text-to-image", "mlsd-text-to-image", "normal-text-to-image", "openpose-text-to-image", "seg-text-to-image". There may be multiple tasks of the same type. Think step by step about all the tasks needed to resolve the user\'s request. Parse out as few tasks as possible while ensuring that the user request can be resolved. Pay attention to the dependencies and order among tasks. If the user input can\'t be parsed, you need to reply empty JSON [].'}, {'role': 'user', 'content': 'Give you some pictures e1.jpg, e2.png, e3.jpg, help me count the number of sheep?'}, {'role': 'assistant', 'content': '[{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "e1.jpg" }}, {"task": "object-detection", "id": 1, "dep": [-1], "args": {"image": "e1.jpg" }}, {"task": "visual-question-answering", "id": 2, "dep": [1], "args": {"image": "<GENERATED>-1", "text": "How many sheep in the picture"}} }}, {"task": "image-to-text", "id": 3, "dep": [-1], "args": {"image": "e2.png" }}, {"task": "object-detection", "id": 4, "dep": [-1], "args": {"image": "e2.png" }}, {"task": "visual-question-answering", "id": 5, "dep": [4], "args": {"image": "<GENERATED>-4", "text": "How many sheep in the picture"}} }}, {"task": "image-to-text", "id": 6, "dep": [-1], "args": {"image": "e3.jpg" }}, {"task": "object-detection", "id": 7, "dep": [-1], "args": {"image": "e3.jpg" }}, {"task": "visual-question-answering", "id": 8, "dep": [7], "args": {"image": "<GENERATED>-7", "text": "How many sheep in the picture"}}]'}, {'role': 'user', 'content': 'Look at /e.jpg, can you tell me how many objects in the picture? Give me a picture and video similar to this one.'}, {'role': 'assistant', 'content': '[{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "/e.jpg" }}, {"task": "object-detection", "id": 1, "dep": [-1], "args": {"image": "/e.jpg" }}, {"task": "visual-question-answering", "id": 2, "dep": [1], "args": {"image": "<GENERATED>-1", "text": "how many objects in the picture?" }}, {"task": "text-to-image", "id": 3, "dep": [0], "args": {"text": "<GENERATED-0>" }}, {"task": "image-to-image", "id": 4, "dep": [-1], "args": {"image": "/e.jpg" }}, {"task": "text-to-video", "id": 5, "dep": [0], "args": {"text": "<GENERATED-0>" }}]'}, {'role': 'user', 'content': 'given a document /images/e.jpeg, answer me what is the student amount? And describe the image with your voice'}, {'role': 'assistant', 'content': '{"task": "document-question-answering", "id": 0, "dep": [-1], "args": {"image": "/images/e.jpeg", "text": "what is the student amount?" }}, {"task": "visual-question-answering", "id": 1, "dep": [-1], "args": {"image": "/images/e.jpeg", "text": "what is the student amount?" }}, {"task": "image-to-text", "id": 2, "dep": [-1], "args": {"image": "/images/e.jpg" }}, {"task": "text-to-speech", "id": 3, "dep": [2], "args": {"text": "<GENERATED>-2" }}]'}, {'role': 'user', 'content': 'Given an image /example.jpg, first generate a hed image, then based on the hed image generate a new image where a girl is reading a book'}, {'role': 'assistant', 'content': '[{"task": "openpose-control", "id": 0, "dep": [-1], "args": {"image": "/example.jpg" }}, {"task": "openpose-text-to-image", "id": 1, "dep": [0], "args": {"text": "a girl is reading a book", "image": "<GENERATED>-0" }}]'}, {'role': 'user', 'content': "please show me a video and an image of (based on the text) 'a boy is running' and dub it"}, {'role': 'assistant', 'content': '[{"task": "text-to-video", "id": 0, "dep": [-1], "args": {"text": "a boy is running" }}, {"task": "text-to-speech", "id": 1, "dep": [-1], "args": {"text": "a boy is running" }}, {"task": "text-to-image", "id": 2, "dep": [-1], "args": {"text": "a boy is running" }}]'}, {'role': 'user', 'content': 'please show me a joke and an image of cat'}, {'role': 'assistant', 'content': '[{"task": "conversational", "id": 0, "dep": [-1], "args": {"text": "please show me a joke of cat" }}, {"task": "text-to-image", "id": 1, "dep": [-1], "args": {"text": "a photo of cat" }}]'}, {'role': 'user', 'content': 'The chat log [ [] ] may contain the resources I mentioned. Now I input { based on the /examples/a.jpg, please generate a video and audio }. Pay attention to the input and output types of tasks and the dependencies between tasks.'}]
2023-04-06 17:18:23,834 - awesome_chat - DEBUG - {"id":"cmpl-72CImEORT89oJrJWiLw2acE0X57SX","object":"text_completion","created":1680758300,"model":"text-davinci-003","choices":[{"text":"\n[{\"task\": \"image-to-text\", \"id\": 0, \"dep\": [-1], \"args\": {\"image\": \"/examples/a.jpg\" }}, {\"task\": \"text-to-video\", \"id\": 1, \"dep\": [0], \"args\": {\"text\": \"<GENERATED>-0\" }}, {\"task\": \"text-to-speech\", \"id\": 2, \"dep\": [0], \"args\": {\"text\": \"<GENERATED>-0\" }}]","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":1919,"completion_tokens":113,"total_tokens":2032}}
2023-04-06 17:18:23,834 - awesome_chat - INFO - [{"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "/examples/a.jpg" }}, {"task": "text-to-video", "id": 1, "dep": [0], "args": {"text": "<GENERATED>-0" }}, {"task": "text-to-speech", "id": 2, "dep": [0], "args": {"text": "<GENERATED>-0" }}]
2023-04-06 17:18:23,834 - awesome_chat - DEBUG - [{'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': '/examples/a.jpg'}}, {'task': 'text-to-video', 'id': 1, 'dep': [0], 'args': {'text': '<GENERATED>-0'}}, {'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': '<GENERATED>-0'}}]
2023-04-06 17:18:23,853 - awesome_chat - DEBUG - Run task: 0 - image-to-text
2023-04-06 17:18:23,853 - awesome_chat - DEBUG - Deps: []
2023-04-06 17:18:23,853 - awesome_chat - DEBUG - parsed task: {'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': 'public//examples/a.jpg'}}
2023-04-06 17:18:25,089 - awesome_chat - DEBUG - avaliable models on image-to-text: {'local': ['nlpconnect/vit-gpt2-image-captioning'], 'huggingface': ['microsoft/trocr-base-printed', 'kha-white/manga-ocr-base', 'nlpconnect/vit-gpt2-image-captioning', 'Salesforce/blip-image-captioning-base']}
2023-04-06 17:18:25,089 - awesome_chat - DEBUG - [{'role': 'system', 'content': '#2 Model Selection Stage: Given the user request and the parsed tasks, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The assistant should focus more on the description of the model and find the model that has the most potential to solve requests and tasks. Also, prefer models with local inference endpoints for speed and stability.'}, {'role': 'user', 'content': 'based on the /examples/a.jpg, please generate a video and audio'}, {'role': 'assistant', 'content': "{'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': 'public//examples/a.jpg'}}"}, {'role': 'user', 'content': 'Please choose the most suitable model from [{\'id\': \'nlpconnect/vit-gpt2-image-captioning\', \'inference endpoint\': [\'nlpconnect/vit-gpt2-image-captioning\'], \'likes\': 219, \'description\': \'\\n\\n# nlpconnect/vit-gpt2-image-captioning\\n\\nThis is an image captioning model trained by @ydshieh in [\', \'language\': None, \'tags\': None}, {\'id\': \'microsoft/trocr-base-printed\', \'inference endpoint\': [\'microsoft/trocr-base-printed\', \'kha-white/manga-ocr-base\', \'nlpconnect/vit-gpt2-image-captioning\', \'Salesforce/blip-image-captioning-base\'], \'likes\': 56, \'description\': \'\\n\\n# TrOCR (base-sized model, fine-tuned on SROIE) \\n\\nTrOCR model fine-tuned on the [SROIE dataset](ht\', \'language\': None, \'tags\': None}, {\'id\': \'Salesforce/blip-image-captioning-base\', \'inference endpoint\': [\'microsoft/trocr-base-printed\', \'kha-white/manga-ocr-base\', \'nlpconnect/vit-gpt2-image-captioning\', \'Salesforce/blip-image-captioning-base\'], \'likes\': 44, \'description\': \'\\n\\n# BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Ge\', \'language\': None, \'tags\': None}, {\'id\': \'kha-white/manga-ocr-base\', \'inference endpoint\': [\'microsoft/trocr-base-printed\', \'kha-white/manga-ocr-base\', \'nlpconnect/vit-gpt2-image-captioning\', \'Salesforce/blip-image-captioning-base\'], \'likes\': 24, \'description\': \'\\n\\n# Manga OCR\\n\\nOptical character recognition for Japanese text, with the main focus being Japanese m\', \'language\': None, \'tags\': None}] for the task {\'task\': \'image-to-text\', \'id\': 0, \'dep\': [-1], \'args\': {\'image\': \'public//examples/a.jpg\'}}. The output must be in a strict JSON format: {"id": "id", "reason": "your detail reasons for the choice"}.'}]
2023-04-06 17:18:27,771 - awesome_chat - DEBUG - {"id":"cmpl-72CIr730hJuLfN3gOFo3iKItF29Dp","object":"text_completion","created":1680758305,"model":"text-davinci-003","choices":[{"text":"\n{\"id\": \"nlpconnect/vit-gpt2-image-captioning\", \"reason\": \"This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes\"}","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":763,"completion_tokens":59,"total_tokens":822}}
2023-04-06 17:18:27,772 - awesome_chat - DEBUG - chosen model: {"id": "nlpconnect/vit-gpt2-image-captioning", "reason": "This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes"}
2023-04-06 17:18:28,151 - awesome_chat - DEBUG - inference result: {'generated text': 'a cat sitting on a window sill looking out '}
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - Run task: 1 - text-to-video
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - Deps: [{"task": {"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "public//examples/a.jpg"}}, "inference result": {"generated text": "a cat sitting on a window sill looking out "}, "choose model result": {"id": "nlpconnect/vit-gpt2-image-captioning", "reason": "This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes"}}]
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - Detect the generated text of dependency task (from results):a cat sitting on a window sill looking out
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - Detect the image of dependency task (from args): public//examples/a.jpg
2023-04-06 17:18:28,370 - awesome_chat - DEBUG - parsed task: {'task': 'text-to-video', 'id': 1, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}
2023-04-06 17:18:28,371 - awesome_chat - DEBUG - Run task: 2 - text-to-speech
2023-04-06 17:18:28,371 - awesome_chat - DEBUG - Deps: [{"task": {"task": "image-to-text", "id": 0, "dep": [-1], "args": {"image": "public//examples/a.jpg"}}, "inference result": {"generated text": "a cat sitting on a window sill looking out "}, "choose model result": {"id": "nlpconnect/vit-gpt2-image-captioning", "reason": "This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes"}}]
2023-04-06 17:18:28,372 - awesome_chat - DEBUG - Detect the generated text of dependency task (from results):a cat sitting on a window sill looking out
2023-04-06 17:18:28,372 - awesome_chat - DEBUG - Detect the image of dependency task (from args): public//examples/a.jpg
2023-04-06 17:18:28,372 - awesome_chat - DEBUG - parsed task: {'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}
2023-04-06 17:18:29,585 - awesome_chat - DEBUG - avaliable models on text-to-video: {'local': ['damo-vilab/text-to-video-ms-1.7b'], 'huggingface': []}
2023-04-06 17:18:29,586 - awesome_chat - DEBUG - chosen model: {'id': 'damo-vilab/text-to-video-ms-1.7b', 'reason': 'Only one model available.'}
2023-04-06 17:18:29,635 - awesome_chat - DEBUG - avaliable models on text-to-speech: {'local': ['espnet/kan-bayashi_ljspeech_vits'], 'huggingface': ['facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur']}
2023-04-06 17:18:29,635 - awesome_chat - DEBUG - [{'role': 'system', 'content': '#2 Model Selection Stage: Given the user request and the parsed tasks, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The assistant should focus more on the description of the model and find the model that has the most potential to solve requests and tasks. Also, prefer models with local inference endpoints for speed and stability.'}, {'role': 'user', 'content': 'based on the /examples/a.jpg, please generate a video and audio'}, {'role': 'assistant', 'content': "{'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}"}, {'role': 'user', 'content': 'Please choose the most suitable model from [{\'id\': \'espnet/kan-bayashi_ljspeech_vits\', \'inference endpoint\': [\'espnet/kan-bayashi_ljspeech_vits\'], \'likes\': 70, \'description\': \'\\n## ESPnet2 TTS pretrained model \\n### `kan-bayashi/ljspeech_vits`\\n♻️ Imported from https://zenodo.or\', \'language\': None, \'tags\': None}, {\'id\': \'facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur\', \'inference endpoint\': [\'facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur\'], \'likes\': 14, \'description\': \'\\n## unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur\\n\\nSpeech-to-speech translation mo\', \'language\': None, \'tags\': None}] for the task {\'task\': \'text-to-speech\', \'id\': 2, \'dep\': [0], \'args\': {\'text\': \'a cat sitting on a window sill looking out \'}}. The output must be in a strict JSON format: {"id": "id", "reason": "your detail reasons for the choice"}.'}]
2023-04-06 17:18:32,377 - awesome_chat - DEBUG - {"id":"cmpl-72CIveauNwMazGxHs0hEKlH44qlE4","object":"text_completion","created":1680758309,"model":"text-davinci-003","choices":[{"text":"\n{\"id\": \"espnet/kan-bayashi_ljspeech_vits\", \"reason\": \"This model is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability\"}","index":0,"logprobs":null,"finish_reason":null}],"usage":{"prompt_tokens":525,"completion_tokens":48,"total_tokens":573}}
2023-04-06 17:18:32,377 - awesome_chat - DEBUG - chosen model: {"id": "espnet/kan-bayashi_ljspeech_vits", "reason": "This model is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability"}
2023-04-06 17:18:53,582 - awesome_chat - DEBUG - inference result: {'generated audio': '/audios/7540.wav'}
2023-04-06 17:19:06,242 - awesome_chat - DEBUG - inference result: {'generated video': '/videos/da6b.mp4'}
2023-04-06 17:19:06,258 - awesome_chat - DEBUG - {0: {'task': {'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': 'public//examples/a.jpg'}}, 'inference result': {'generated text': 'a cat sitting on a window sill looking out '}, 'choose model result': {'id': 'nlpconnect/vit-gpt2-image-captioning', 'reason': 'This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes'}}, 2: {'task': {'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}, 'inference result': {'generated audio': '/audios/7540.wav'}, 'choose model result': {'id': 'espnet/kan-bayashi_ljspeech_vits', 'reason': 'This model is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability'}}, 1: {'task': {'task': 'text-to-video', 'id': 1, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}, 'inference result': {'generated video': '/videos/da6b.mp4'}, 'choose model result': {'id': 'damo-vilab/text-to-video-ms-1.7b', 'reason': 'Only one model available.'}}}
2023-04-06 17:19:06,259 - awesome_chat - DEBUG - [{'role': 'system', 'content': '#4 Response Generation Stage: With the task execution logs, the AI assistant needs to describe the process and inference results.'}, {'role': 'user', 'content': 'based on the /examples/a.jpg, please generate a video and audio'}, {'role': 'assistant', 'content': "Before give you a response, I want to introduce my workflow for your request, which is shown in the following JSON data: [{'task': {'task': 'image-to-text', 'id': 0, 'dep': [-1], 'args': {'image': 'public//examples/a.jpg'}}, 'inference result': {'generated text': 'a cat sitting on a window sill looking out '}, 'choose model result': {'id': 'nlpconnect/vit-gpt2-image-captioning', 'reason': 'This model is the most suitable for the task of image-to-text as it is trained by @ydshieh and has the highest number of likes'}}, {'task': {'task': 'text-to-video', 'id': 1, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}, 'inference result': {'generated video': '/videos/da6b.mp4'}, 'choose model result': {'id': 'damo-vilab/text-to-video-ms-1.7b', 'reason': 'Only one model available.'}}, {'task': {'task': 'text-to-speech', 'id': 2, 'dep': [0], 'args': {'text': 'a cat sitting on a window sill looking out '}}, 'inference result': {'generated audio': '/audios/7540.wav'}, 'choose model result': {'id': 'espnet/kan-bayashi_ljspeech_vits', 'reason': 'This model is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability'}}]. Do you have any demands regarding my response?"}, {'role': 'user', 'content': "Yes. Please first think carefully and directly answer my request based on the inference results. Then please detail your workflow step by step including the used models and inference results for my request in your friendly tone. Please filter out information that is not relevant to my request. If any generated files of images, audios or videos in the inference results, must tell me the complete path. If there is nothing in the results, please tell me you can't make it. }"}]
2023-04-06 17:19:12,879 - awesome_chat - DEBUG - {"id":"cmpl-72CJWrjDQwz22fi9HNmLVbojR8X9j","object":"text_completion","created":1680758346,"model":"text-davinci-003","choices":[{"text":"\nI have carefully considered your request and I can generate a video and audio based on the image you provided. I have used three models in my workflow, which are image-to-text, text-to-video and text-to-speech. \n\nFor the image-to-text task, I used the model nlpconnect/vit-gpt2-image-captioning, which is trained by @ydshieh and has the highest number of likes. The inference result is a text description of the image: \"a cat sitting on a window sill looking out\". \n\nFor the text-to-video task, I used the model damo-vilab/text-to-video-ms-1.7b, which is the only model available. The inference result is a video file located at /videos/da6b.mp4. \n\nFor the text-to-speech task, I used the model espnet/kan-bayashi_ljspeech_vits, which is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability. The inference result is an audio file located at /audios/7540.wav.\n\nI hope this information is helpful. Please let me know if you have any other questions. ","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":579,"completion_tokens":277,"total_tokens":856}}
2023-04-06 17:19:12,879 - awesome_chat - INFO - response: I have carefully considered your request and I can generate a video and audio based on the image you provided. I have used three models in my workflow, which are image-to-text, text-to-video and text-to-speech.
For the image-to-text task, I used the model nlpconnect/vit-gpt2-image-captioning, which is trained by @ydshieh and has the highest number of likes. The inference result is a text description of the image: "a cat sitting on a window sill looking out".
For the text-to-video task, I used the model damo-vilab/text-to-video-ms-1.7b, which is the only model available. The inference result is a video file located at /videos/da6b.mp4.
For the text-to-speech task, I used the model espnet/kan-bayashi_ljspeech_vits, which is a pretrained model from ESPnet2 TTS and has an inference endpoint for local speed and stability. The inference result is an audio file located at /audios/7540.wav.
I hope this information is helpful. Please let me know if you have any other questions.
All right. I'll be back later to address that after a short meeting.
Hi, was the problem solved? I'm having a hard time reproducing the problem in my environment.
No it’s still there. Clean install. Ubuntu 22.04. Anything I can do to narrow down the scope of the bug?
FileNotFoundError: [Errno 2] No such file or directory: 'public//videos/293f.mp4'
Is this file actually generated and can it be found in public/videos?
You can add these lines at the beginning of run_gradio_demo.py
import os
os.makedirs("public/images", exist_ok=True)
os.makedirs("public/audios", exist_ok=True)
os.makedirs("public/videos", exist_ok=True)
Server is up an running. Commit SHA - bc66e5a Using gradio
python run_gradio_demo.py --config config.gradio.yaml
:Running built-in example
based on the /examples/a.jpg, please generate a video and audio
. Gradio terminal output: