ValueError: empty range for randrange() (1, 1, 0)

JWargrave commented 1 month ago

I train open-sora on my own dataset, the bucket_config is as follows:


bucket_config = {
    "1080p":{12:(1.0,1)},
}

49 steps have successfully trained, then the following error occurs:

Traceback (most recent call last):
  File "/data/apps/yiming_zhao/Open-Sora2/scripts/train.py", line 440, in <module>
    main()
  File "/data/apps/yiming_zhao/Open-Sora2/scripts/train.py", line 315, in main
    mask = mask_generator.get_masks(x)
  File "/data/apps/yiming_zhao/Open-Sora2/opensora/utils/train_utils.py", line 165, in get_masks
    mask = self.get_mask(x)
  File "/data/apps/yiming_zhao/Open-Sora2/opensora/utils/train_utils.py", line 143, in get_mask
    random_size = random.randint(1, condition_frames_max)
  File "/data/apps/miniconda3/envs/zym-os2/lib/python3.9/random.py", line 338, in randint
    return self.randrange(a, b+1)
  File "/data/apps/miniconda3/envs/zym-os2/lib/python3.9/random.py", line 316, in randrange
    raise ValueError("empty range for randrange() (%d, %d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (1, 1, 0)

JThh commented 1 month ago

May I know your training configs? Slightly enlarging the batch size or lowering the probability of putting in the 1080p bucket should be able to help. But this is a bug we might want to fix as well. @xyupeng @zhengzangw.

JWargrave commented 1 month ago

May I know your training configs? Slightly enlarging the batch size or lowering the probability of putting in the 1080p bucket should be able to help. But this is a bug we might want to fix as well. @xyupeng @zhengzangw.

The training configs are as follows:

# Dataset settings
dataset = dict(
    type="VariableVideoTextDataset",
    transform_name="resize_crop",
)

bucket_config = {
    "1080p":{12:(1.0,1)},
}

grad_checkpoint = True

# Acceleration settings
num_workers = 8
num_bucket_build_workers = 16
dtype = "bf16"
plugin = "zero2"

# Model settings
model = dict(
    type="STDiT3-XL/2",
    from_pretrained='/data/apps/yiming_zhao/Open-Sora2/pretrained_models/OpenSora-STDiT-v3',
    qk_norm=True,
    enable_flash_attn=False,
    enable_layernorm_kernel=False,
    freeze_y_embedder=True,
)
vae = dict(
    type="OpenSoraVAE_V1_2",
    from_pretrained="/data/apps/yiming_zhao/Open-Sora2/pretrained_models/OpenSora-VAE-v1.2",
    micro_frame_size=17,
    micro_batch_size=4,
)
text_encoder = dict(
    type="t5",
    from_pretrained="/data/apps/yiming_zhao/Open-Sora2/pretrained_models/t5-v1_1-xxl",
    model_max_length=300,
    shardformer=False,
)
scheduler = dict(
    type="rflow",
    use_timestep_transform=True,
    sample_method="logit-normal",
)

# Mask settings
mask_ratios = {
    "random": 0.05,
    "intepolate": 0.005,
    "quarter_random": 0.005,
    "quarter_head": 0.005,
    "quarter_tail": 0.005,
    "quarter_head_tail": 0.005,
    "image_random": 0.025,
    "image_head": 0.05,
    "image_tail": 0.025,
    "image_head_tail": 0.025,
}

# Log settings
seed = 42
outputs = "outputs"
wandb = False
epochs = 1000
log_every = 10
ckpt_every = 10

# optimization settings
load = None
grad_clip = 1.0
lr = 1e-4
ema_decay = 0.99
adam_eps = 1e-15
warmup_steps = 1000

And I also modified the [get_bucket_id](https://github.com/hpcaitech/Open-Sora/blob/bf4d6673af9407650b2ce6250debc2453b82c572/opensora/datasets/bucket.py#L74) function:

def get_bucket_id(self,T,H,W,frame_interval=1,seed=None): # do not account for image
        item_resolution=H*W
        resolutions_to_select=self.bucket_probs.keys()
        # print(resolutions_to_select)
        item_hw_id=None

        ############### Plan A ###############
        # for hw_id in resolutions_to_select:
        #     if self.hw_criteria[hw_id]<=item_resolution:
        #         item_hw_id=hw_id
        #         break
        ######################################

        ############### Plan B ###############

        resolutions_to_select=list(resolutions_to_select)[::-1]
        for r_idx in range(len(resolutions_to_select)):
            hw_id=resolutions_to_select[r_idx]
            if self.hw_criteria[hw_id]>item_resolution:
                if r_idx>0 or True: # TODO
                    item_hw_id=resolutions_to_select[max(0,r_idx-1)]
                break

        if item_hw_id is None and r_idx==len(resolutions_to_select)-1:
            item_hw_id=resolutions_to_select[-1]

        ######################################

        if item_hw_id is None:
            return None

        item_t_id=None
        for t_id in self.t_criteria[item_hw_id]:
            if self.t_criteria[item_hw_id][t_id]<=T*frame_interval:
                item_t_id=t_id
                break
        if item_t_id is None:
            return None

        item_ar_id=get_closest_ratio(H,W,self.ar_criteria[item_hw_id][item_t_id])
        return item_hw_id,item_t_id,item_ar_id

The get_bucket_id function above is much easier than the original one. It first selects the closest resolution given H,W (Each H,W will correspond to 1080p in my setting). Then If the video frame number is less than 12, None will be returned. The above code defaults keep_prob to 1 and seed is not used.

The batch size cannot be increased any further, otherwise a CUDA out of memory error will occur. And I want to train the model at a higher resolution, so I want to set keep_prob to 1 all the time.

I ran the code three times and the ValueError always occurred after the 49th step.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

zhengzangw commented 1 month ago

It seems the problem is with the mask. Perhaps you need a num_frame larger than 17 frames as our VAE compress 17 frames to 5.

github-actions[bot] commented 5 days ago

This issue is stale because it has been open for 7 days with no activity.

hpcaitech / Open-Sora

ValueError: empty range for randrange() (1, 1, 0) #612