[HELP] Why my videos looks so awful?

When I create a video, they look like this. I don't think the prompt is wrong. This is the configuration I use (right now it's at 360p but the same thing happens at 720p; it's not a resolution problem, I'm just creating videos in 360p for speed reasons).

I’ve tried more than 100 prompts: longer, shorter, with more details, with fewer details, simpler, more complex, and basically all of them look very similar to this video. Am I doing something wrong?

Any advice you can give me?? Thank you very much.

{'aes': 7.0, 'align': 5, 'aspect_ratio': '16:9', 'batch_size': 1, 'condition_frame_length': 5, 'config': 'configs/opensora-v1-2/inference/sample.py', 'dtype': 'bf16', 'flow': 5.0, 'fps': 24, 'frame_interval': 1, 'model': {'enable_flash_attn': True, 'enable_layernorm_kernel': True, 'force_huggingface': True, 'from_pretrained': 'hpcai-tech/OpenSora-STDiT-v3', 'qk_norm': True, 'type': 'STDiT3-XL/2'}, 'multi_resolution': 'STDiT2', 'num_frames': '120', 'prompt': ['A cyclist racing down a forested mountain trail. The cyclist ' 'weaves between trees, dodging roots and rocks, with incredible ' 'speed and agility. The trail is narrow and treacherous, with ' 'dense foliage on either side. The scene is a blur of motion, ' 'capturing the adrenaline and challenge of mountain biking.'], 'prompt_as_path': False, 'resolution': '360p', 'save_dir': './samples/samples/', 'save_fps': 24, 'scheduler': {'cfg_scale': 7.0, 'num_sampling_steps': 80, 'type': 'rflow', 'use_timestep_transform': True}, 'seed': 44, 'text_encoder': {'from_pretrained': 'DeepFloyd/t5-v1_1-xxl', 'model_max_length': 300, 'type': 't5'}, 'vae': {'force_huggingface': True, 'from_pretrained': 'hpcai-tech/OpenSora-VAE-v1.2', 'micro_batch_size': 4, 'micro_frame_size': 17, 'type': 'OpenSoraVAE_V1_2'}, 'watermark': False}

https://github.com/hpcaitech/Open-Sora/assets/64336798/cc21192b-e65a-470e-a65c-786f79820dd4

EDIT: Others examples torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora-v1-2/inference/sample.py \ --num-frames 120 --resolution 360p --aspect-ratio 16:9 --watermark False --aes 7 --flow 5 --num-sampling-steps 80 --cfg-scale 7 \ --prompt "A cozy coffe interior on a rainy day. Large windows show the rain falling outside, creating a soothing backdrop. Inside, the café is warm and inviting, with wooden tables, cushioned chairs, and soft lighting. A barista is seen making coffee behind the counter, and patrons are chatting or reading. The atmosphere is relaxed and comforting."

https://github.com/hpcaitech/Open-Sora/assets/64336798/8302946c-2ef9-431b-8622-0fe50396b2d6

torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora-v1-2/inference/sample.py \ --num-frames 120 --resolution 360p --aspect-ratio 16:9 --watermark False --aes 7 --flow 5 --num-sampling-steps 80 --cfg-scale 7 --seed 44 \ --prompt "A peaceful garden with a koi pond. The pond is surrounded by stones and lush greenery, with koi fish swimming gracefully in the clear water. A small wooden bridge arches over the pond, and a stone lantern adds to the tranquil setting. The garden is quiet, with the sound of water gently flowing and birds singing. The atmosphere is serene and meditative."

https://github.com/hpcaitech/Open-Sora/assets/64336798/ee0b2662-87cd-43a1-b3e0-2c999021ff24

torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora-v1-2/inference/sample.py \ --num-frames 120 --resolution 360p --aspect-ratio 16:9 --watermark False --aes 7 --flow 5 --num-sampling-steps 80 --cfg-scale 7 \ --prompt "A tranquil mountain lake surrounded by pine trees. The water is crystal clear, reflecting the surrounding landscape like a mirror. A small wooden pier extends into the lake, with a lone rowboat tied to it. The mountains in the background are majestic, their peaks dusted with snow. The air appears crisp and the scene is calm and serene."

https://github.com/hpcaitech/Open-Sora/assets/64336798/a20bd0a9-a2a0-42ed-89bc-da89b3a97539

=========================================================

TIP: If you first generate an image (text-to-image) and then create a text-to-video using the image as a reference, the quality of the videos improves significantly.

For example:

torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora-v1-2/inference/sample.py \ --num-frames 1 --resolution 1080p --aspect-ratio 16:9 --watermark False --aes 7 --seed 44 --cfg-scale 7 --sample-name image-cond \ --prompt "An underwater city inhabited by bioluminescent sea creatures, glowing in the depths of the ocean."

image-cond_0000 (1)

torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora-v1-2/inference/sample.py \ --num-frames 4s --resolution 360p --aspect-ratio 16:9 --watermark False --aes 7 --flow 5 --num-sampling-steps 90 --seed 44 --cfg-scale 7 \ --prompt 'Create a video capturing the breathtaking beauty of the sunset over the serene lake, with the mountains silhouetted against the colorful sky.{"reference_path": "samples/samples/image-cond_0000.png","mask_strategy": "0"}'

https://github.com/hpcaitech/Open-Sora/assets/64336798/bf278b48-0bb3-4b4c-a2b2-1ebb9537836a

hpcaitech / Open-Sora

[HELP] Why my videos looks so awful? #550