sabarim / STEm-Seg

This repository contains the official implementation of the paper "STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos"
153 stars 23 forks source link

Conversion of youtubevis annotations to your dataformat #26

Closed codingS3b closed 2 years ago

codingS3b commented 2 years ago

I would love to try out your tool but am struggling to get the json file into the format that is expected by your code in order to pass my data in.

Currently my json file follows the format of the youtubevis challenge. That is, when applying json.load the dictionary has the following structure:

'videos': [vid_1, vid_2, ...], # length equal to the number of videos
'annotations': [ann_1, ann_2, ...],  # length equal to the total number of instances in all the videos
'categories': [cat_1, cat_2, ....]  # length equal to the number of categories in the dataset

where the structure for each video looks like

'id': int,
'width': int,
'height': int,
'length': int,
'file_names': [f_1, f_2, ...], # list of strings containing relative paths
'

and the structure for each annotation looks like

'iscrowd': int (0 or 1),
'id': int,
'video_id': int,
'category_id': int,
'segmentations': [seg_1, seg_2, ....], # the segmentation of this instance in each frame of the video, either None or a dict with 'counts' and 'size' for the runlength encoded segmentation mask
'areas': [area_1, area_2, ...],  # the area of this instance in each frame of the video, either None or an int
'bboxes': [box_1, box_2, ....],  # the coordinates of this instance in each frame of the video, either None or a list with 4 entries

and the structure for each category looks like

'id': int,
'name': str,
'supercategory': str

Do you have a script at hand or maybe some tips how to convert that into the file format you expect?

I would greatly appreciate your help here!

codingS3b commented 2 years ago

From looking a bit deeper, I guess I can manage to fill most of the stuff myself, however, I'm puzzled what kind of data is actually expected for the segmentations here (isn't .encode('utf8') usually called on strings?

To sum it up, I think the structure expected is as follows:

'meta': 
  'category_labels': dict # maps category ids to category names
'sequences': list of dicts  # length is equal to the number of videos in the dataset

# Now each dict of the 'sequences' list (i.e. the information of a single video) has format
'id': int,
'width': int,
'height': int,
'image_paths': list of str,
'categories': dict, # maps each occuring instance id of the video to the respective category id
'segmentations': list of dict # length equal to the number of frames in the video

# Now each dict of the 'segmentations' list (i.e. the information of a single frame in the video) 
# maps instance ids to some value that I'm not sure about

@Ali2500, are my assumptions correct? If yes, I would only need a pointer on how to correctly format the 'segmentations' entries (or more the values in the dictionaries).

Ali2500 commented 2 years ago

Hi,

The RLE encoded mask returned by pycocotools is in binary format if I recall correctly, therefore you need to call .decode("utf-8") on it before dumping it to JSON.

So given a numpy array mask of type uint8, the entry in the segmentations field for this frame and instance would be: pycocotools.mask.encode(np.asfortranarray(mask))["counts"].decode("utf-8"). For efficiency, we only store the actual RLE encoding and not the full dict returned by pycocotools since the image dimensions are the same across a video.

Ali2500 commented 2 years ago

I unfortunately don't have a conversion script at hand, but it seems you deciphered the format correctly. If I remember correctly, the bboxes and areas fields aren't used in the final code anywhere though (doesn't hurt to have them though).

codingS3b commented 2 years ago

Thanks for your help @Ali2500, I think I now managed to get your format right. At least using the GenericVideoSequence class seems to work out fine!