pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.33k stars 6.97k forks source link

Custom Coco DataPipe #7147

Open austinmw opened 1 year ago

austinmw commented 1 year ago

πŸ“š The doc issue

Hi, is it possible to subclass/reuse torchvision.prototype.datasets.Coco to create a custom COCO-format DataPipe? For example If I have the following data:

data/vision/tiny_coco
β”œβ”€β”€ instances_train2017_small.json
└── train_2017_small
    β”œβ”€β”€ 000000005802.jpg
    β”œβ”€β”€ 000000060623.jpg
    β”œβ”€β”€ 000000118113.jpg
    β”œβ”€β”€ 000000184613.jpg
    β”œβ”€β”€ 000000193271.jpg
    β”œβ”€β”€ 000000222564.jpg
    β”œβ”€β”€ 000000224736.jpg
    β”œβ”€β”€ 000000309022.jpg
    β”œβ”€β”€ 000000318219.jpg
    β”œβ”€β”€ 000000374628.jpg
    β”œβ”€β”€ 000000391895.jpg
    β”œβ”€β”€ 000000403013.jpg
    β”œβ”€β”€ 000000483108.jpg
    β”œβ”€β”€ 000000522418.jpg
    β”œβ”€β”€ 000000554625.jpg
    └── 000000574769.jpg

Suggest a potential alternative/fix

I think a lot of people could benefit from understanding how to reuse common format classes for DataPipes

cc @pmeier @bjuncek

pmeier commented 1 year ago

Hey @austinmw. If I understand your request correctly, you want to re-use the datapipes that we have build for COCO, but for your custom data, right? Meaning, you provide the same structure as the original one, but with different data, right?

If yes, I don't think there is a non-hacky way so far. We have currently frozen all development on the prototype datasets as we are focusing on the transforms revamp #6753. Plus, we also need to figure out some performance issues with the prototype datasets before we continue there. Meaning, we don't know when or if we can implement a feature like that.

In the mean time, if you are comfortable with a hacky solution, I suggest subclassing

https://github.com/pytorch/vision/blob/7cf0f4cc1801ff1892007c7a11f7c35d8dfb7fd0/torchvision/prototype/datasets/_builtin/coco.py#L43

and overwrite the

https://github.com/pytorch/vision/blob/7cf0f4cc1801ff1892007c7a11f7c35d8dfb7fd0/torchvision/prototype/datasets/_builtin/coco.py#L90

method. You can find the implementation of the OnlineResource's here:

https://github.com/pytorch/vision/blob/7cf0f4cc1801ff1892007c7a11f7c35d8dfb7fd0/torchvision/prototype/datasets/utils/_resource.py#L28

Subclass that and let it file_name point to instances_train2017_small.json and train_2017_small respectively. If the file or directory exists, you bypass all the download logic and directly trigger the loading:

https://github.com/pytorch/vision/blob/7cf0f4cc1801ff1892007c7a11f7c35d8dfb7fd0/torchvision/prototype/datasets/utils/_resource.py#L59-L69

If your data structure matches the one from COCO, you should be good to go.

austinmw commented 1 year ago

Hey @pmeier Thanks, yep you understood correctly, that's exactly what I'd like to do. I hope that a feature to use our own data while reusing datapipes (such as coco) will be available in the future!

For anyone who might be interested, I got this working using this dataset: https://github.com/austinmw/tiny-coco

And the following code:

import os
import re
import csv
import pathlib
from collections import defaultdict, OrderedDict
import requests

from typing import Any, BinaryIO, cast, Dict, Iterator, List, Optional, Tuple, Union, Sequence

import torch
from torchdata.datapipes.iter import FileLister
from torchvision.prototype import datasets
from torchvision.prototype.datasets import register_info, register_dataset, Coco
from torchvision.prototype.datapoints import BoundingBox, Label, Mask
from torchvision.prototype.datapoints._datapoint import Datapoint
from torchvision.prototype.datasets.utils import Dataset, EncodedImage, HttpResource, OnlineResource, ManualDownloadResource
from torchdata.datapipes.iter import (
    Demultiplexer,
    Filter,
    Grouper,
    IterDataPipe,
    IterKeyZipper,
    JsonParser,
    Mapper,
    UnBatcher,
)
from torchvision.prototype.datasets.utils._internal import (
    getitem,
    hint_sharding,
    hint_shuffling,
    INFINITE_BUFFER_SIZE,
    MappingIterator,
    path_accessor,
)

NAME = "tiny_coco"

WEB_DIR = "https://raw.githubusercontent.com/pytorch/vision/main/torchvision/prototype/datasets/_builtin"

def read_categories_file(name: str) -> List[Union[str, Sequence[str]]]:
    path = os.path.join(WEB_DIR,  f"{name}.categories")
    decoded_content = requests.get(path).content.decode('utf-8')
    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    rows = list(cr)
    return rows

@register_info(NAME)
def _info() -> Dict[str, Any]:
    categories, super_categories = zip(*read_categories_file("coco"))
    return dict(categories=categories, super_categories=super_categories)

@register_dataset(NAME)
class TinyCoco(Coco):

    def _resources(self) -> List[OnlineResource]:
        images_resource = ManualDownloadResource(file_name="train_2017_small", instructions='Blah')
        meta_resource = ManualDownloadResource(file_name="instances_train2017_small.json", instructions='blah')
        resource_dp = [images_resource, meta_resource]
        return resource_dp

    def _segmentation_to_mask(
        self, segmentation: Any, *, is_crowd: bool, spatial_size: Tuple[int, int]
    ) -> torch.Tensor:
        from pycocotools import mask

        if is_crowd:
            segmentation = mask.frPyObjects(segmentation, *spatial_size)
        else:
            segmentation = mask.merge(mask.frPyObjects(segmentation, *spatial_size))

        return torch.from_numpy(mask.decode(segmentation)).to(torch.bool)

    def _decode_instances_anns(self, anns: List[Dict[str, Any]], image_meta: Dict[str, Any]) -> Dict[str, Any]:
        spatial_size = (image_meta["height"], image_meta["width"])
        labels = [ann["category_id"] for ann in anns]
        return dict(
            segmentations=Mask(
                torch.stack(
                    [
                        self._segmentation_to_mask(
                            ann["segmentation"], is_crowd=ann["iscrowd"], spatial_size=spatial_size
                        )
                        for ann in anns
                    ]
                )
            ),
            areas=Datapoint([ann["area"] for ann in anns]),
            crowds=Datapoint([ann["iscrowd"] for ann in anns], dtype=torch.bool),
            bounding_boxes=BoundingBox(
                [ann["bbox"] for ann in anns],
                format="xywh",
                spatial_size=spatial_size,
            ),
            labels=Label(labels, categories=self._categories),
            super_categories=[self._category_to_super_category[self._categories[label]] for label in labels],
            ann_ids=[ann["id"] for ann in anns],
        )

    def _decode_captions_ann(self, anns: List[Dict[str, Any]], image_meta: Dict[str, Any]) -> Dict[str, Any]:
        return dict(
            captions=[ann["caption"] for ann in anns],
            ann_ids=[ann["id"] for ann in anns],
        )

    _ANN_DECODERS = OrderedDict(
        [
            ("instances", _decode_instances_anns),
            ("captions", _decode_captions_ann),
        ]
    )

    _META_FILE_PATTERN = re.compile(
        rf"(?P<annotations>({'|'.join(_ANN_DECODERS.keys())}))_(?P<split>[a-zA-Z]+)(?P<year>\d+)_small[.]json"
    )

    def _filter_meta_files(self, data: Tuple[str, Any]) -> bool:
        match = self._META_FILE_PATTERN.match(pathlib.Path(data[0]).name)
        return bool(
            match
            and match["split"] == self._split
            and match["year"] == self._year
            and match["annotations"] == self._annotations
        )

    def _datapipe(self, resource_dps: List[IterDataPipe]) -> IterDataPipe[Dict[str, Any]]:
        images_dp, meta_dp = resource_dps

        if self._annotations is None:
            dp = hint_shuffling(images_dp)
            dp = hint_sharding(dp)
            dp = hint_shuffling(dp)
            return Mapper(dp, self._prepare_image)

        meta_dp = Filter(meta_dp, self._filter_meta_files)
        meta_dp = JsonParser(meta_dp)
        meta_dp = Mapper(meta_dp, getitem(1))
        meta_dp: IterDataPipe[Dict[str, Dict[str, Any]]] = MappingIterator(meta_dp)
        #return meta_dp

        images_meta_dp, anns_meta_dp = Demultiplexer(
            meta_dp,
            2,
            self._classify_meta,
            drop_none=True,
            buffer_size=INFINITE_BUFFER_SIZE,
        )

        images_meta_dp = Mapper(images_meta_dp, getitem(1))
        images_meta_dp = UnBatcher(images_meta_dp)

        anns_meta_dp = Mapper(anns_meta_dp, getitem(1))
        anns_meta_dp = UnBatcher(anns_meta_dp)
        anns_meta_dp = Grouper(anns_meta_dp, group_key_fn=getitem("image_id"), buffer_size=INFINITE_BUFFER_SIZE)
        anns_meta_dp = hint_shuffling(anns_meta_dp)
        anns_meta_dp = hint_sharding(anns_meta_dp)

        anns_dp = IterKeyZipper(
            anns_meta_dp,
            images_meta_dp,
            key_fn=getitem(0, "image_id"),
            ref_key_fn=getitem("id"),
            buffer_size=INFINITE_BUFFER_SIZE,
        )

        dp = IterKeyZipper(
            anns_dp,
            images_dp,
            key_fn=getitem(1, "file_name"),
            ref_key_fn=path_accessor("name"),
            buffer_size=INFINITE_BUFFER_SIZE,
        )
        #return images_dp
        return Mapper(dp, self._prepare_sample)

print(datasets.list_datasets())
print(datasets.info('tiny_coco'))

dp = datasets.load("tiny_coco", root="data/vision/tiny_coco")

next(iter(dp))

It feels like quite a lot of effort just to get a COCO TorchData pipeline though, and not sure if many people would go this route πŸ˜…

pmeier commented 1 year ago

It feels like quite a lot of effort just to get a COCO TorchData pipeline though, and not sure if many people would go this route sweat_smile

I think you made it more complicated than it has to be:

  1. Unless you have changed something about the internals, I don't see a reason to have all the methods besides _resources. Given that you subclass from the original dataset, they are already implemented and should work without modification.
  2. You don't need to register your dataset or the accompanying info. datasets.load only provides you an interface to load a dataset by name, but you could also instantiate the class the "regular way":

    https://github.com/pytorch/vision/blob/7cf0f4cc1801ff1892007c7a11f7c35d8dfb7fd0/torchvision/prototype/datasets/_api.py#L59-L65

So, unless I'm missing something, the implementation should look like

class TinyCoco(Coco):

    def _resources(self) -> List[OnlineResource]:
        images_resource = ManualDownloadResource(file_name="train_2017_small", instructions='Blah')
        meta_resource = ManualDownloadResource(file_name="instances_train2017_small.json", instructions='blah')
        resource_dp = [images_resource, meta_resource]
        return resource_dp

dataset = TinyCoco("data/vision/tiny_coco")

And that is not too bad. Of course it doesn't solve the general use case of arbitrary input pipes, but for stuff that you want locally, it should be sufficient

austinmw commented 1 year ago

Oh, so I think I ran into problems because my json is named instances_train2017_small.json. Therefore I needed to edit _META_FILE_PATTERN to append the _small, which then required me to copy/paste _ANN_DECODERS, _decode_instances_anns and _decode_captions_ann.

I didn't need to paste _decoder, _filter_meta_files, or _segmentation_to_mask

I guess the simplest thing to do is just to stick to the default json naming convention. Thanks for your help!

My much shorter version is now:

import os
import csv
import requests
from typing import Any, List, Union, Sequence, Dict

from torchvision.prototype import datasets
from torchvision.prototype.datasets import register_info, register_dataset, Coco
from torchvision.prototype.datasets.utils import ManualDownloadResource

NAME = "tiny_coco"

@register_info(NAME)
def _info() -> Dict[str, Any]:
    return datasets.info('coco')

@register_dataset(NAME)
class TinyCoco(Coco):
    def _resources(self) -> List[ManualDownloadResource]:
        images_resource = ManualDownloadResource(
            file_name="train_2017_small", 
            instructions="Download from https://github.com/austinmw/tiny-coco"
        )
        meta_resource = ManualDownloadResource(
            file_name="instances_train2017.json", 
            instructions="Download from https://github.com/austinmw/tiny-coco"
        )
        resource_dp = [images_resource, meta_resource]
        return resource_dp

assert NAME in datasets.list_datasets()

dp = datasets.load("tiny_coco", root="data/vision/tiny_coco")
next(iter(dp))

Would it make sense at all to add a feature request to create a GitHubResource, or SubversionResource as additional download routes?

pmeier commented 1 year ago

I guess the simplest thing to do is just to stick to the default json naming convention.

Yeah, I've implied this above with me saying you should keep the structure the same. Since we need a way to differentiate the files from each other, file or folder names are often the way to go. On the bright side, I think we never relied on the order of files, so we are never just taking the first file and hope it is the right one. Meaning, you can leave out files that are not actually used by us.

Would it make sense at all to add a feature request to create a GitHubResource, or SubversionResource as additional download routes?

GitHub already provides HTTP direct links for all files, so HTTPResource should suffice. For example

meta_resource = HttpResource(
    "https://raw.githubusercontent.com/austinmw/tiny-coco/master/small_coco/instances_train2017.json"
)

For the images it is a little trickier in your current state. IMO, the best option would be to just zip them like the original dataset does. That way you can use the same technique as above, but for the zip of the images.

If you don't want to do that, something like a GitHubFolderResource might be an option which gets pointed to a folder and than either traverses it (not sure if the GH API allows that) or just clones the repo and moves the files. That being said, GH repos are really bad for hosting large amounts of binary data and thus regular datasets are hosted elsewhere. Meaning, it makes little sense for us to provide such a utility in our library.

austinmw commented 1 year ago

Thanks again, that makes sense. Btw I'm currently downloading the data like this:

svn checkout https://github.com/austinmw/tiny-coco.git/trunk/small_coco ./tiny_coco

pmeier commented 1 year ago

:exploding_head: I was not aware GH even has an API for SVN. You might want to switch to git as the SVN API is deprecated and will be removed in roughly a year.

austinmw commented 1 year ago

Ahh, thanks for the heads up. That's a shame.. I couldn't find an analogous one-liner for downloading a specific folder with git

pmeier commented 1 year ago

The blog post I linked contains the following paragraph:

Why do people still use Subversion on GitHub, anyhow? Besides simple inertia, there were workflows which Git didn’t support until recently. The main thing we heard when speaking with customers and communities was checking out a subset of the repository–a single directory, or only the latest commit. I have good news: with sparse checkout, sparse index, and partial clone, Git can now do a pretty decent job at these workflows.

I'm no expert on this, but it seems one of these commands was built to do just what you need.