Open austinmw opened 1 year ago
Hey @austinmw. If I understand your request correctly, you want to re-use the datapipes that we have build for COCO, but for your custom data, right? Meaning, you provide the same structure as the original one, but with different data, right?
If yes, I don't think there is a non-hacky way so far. We have currently frozen all development on the prototype datasets as we are focusing on the transforms revamp #6753. Plus, we also need to figure out some performance issues with the prototype datasets before we continue there. Meaning, we don't know when or if we can implement a feature like that.
In the mean time, if you are comfortable with a hacky solution, I suggest subclassing
and overwrite the
method. You can find the implementation of the OnlineResource
's here:
Subclass that and let it file_name
point to instances_train2017_small.json
and train_2017_small
respectively. If the file or directory exists, you bypass all the download logic and directly trigger the loading:
If your data structure matches the one from COCO, you should be good to go.
Hey @pmeier Thanks, yep you understood correctly, that's exactly what I'd like to do. I hope that a feature to use our own data while reusing datapipes (such as coco) will be available in the future!
For anyone who might be interested, I got this working using this dataset: https://github.com/austinmw/tiny-coco
And the following code:
import os
import re
import csv
import pathlib
from collections import defaultdict, OrderedDict
import requests
from typing import Any, BinaryIO, cast, Dict, Iterator, List, Optional, Tuple, Union, Sequence
import torch
from torchdata.datapipes.iter import FileLister
from torchvision.prototype import datasets
from torchvision.prototype.datasets import register_info, register_dataset, Coco
from torchvision.prototype.datapoints import BoundingBox, Label, Mask
from torchvision.prototype.datapoints._datapoint import Datapoint
from torchvision.prototype.datasets.utils import Dataset, EncodedImage, HttpResource, OnlineResource, ManualDownloadResource
from torchdata.datapipes.iter import (
Demultiplexer,
Filter,
Grouper,
IterDataPipe,
IterKeyZipper,
JsonParser,
Mapper,
UnBatcher,
)
from torchvision.prototype.datasets.utils._internal import (
getitem,
hint_sharding,
hint_shuffling,
INFINITE_BUFFER_SIZE,
MappingIterator,
path_accessor,
)
NAME = "tiny_coco"
WEB_DIR = "https://raw.githubusercontent.com/pytorch/vision/main/torchvision/prototype/datasets/_builtin"
def read_categories_file(name: str) -> List[Union[str, Sequence[str]]]:
path = os.path.join(WEB_DIR, f"{name}.categories")
decoded_content = requests.get(path).content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
rows = list(cr)
return rows
@register_info(NAME)
def _info() -> Dict[str, Any]:
categories, super_categories = zip(*read_categories_file("coco"))
return dict(categories=categories, super_categories=super_categories)
@register_dataset(NAME)
class TinyCoco(Coco):
def _resources(self) -> List[OnlineResource]:
images_resource = ManualDownloadResource(file_name="train_2017_small", instructions='Blah')
meta_resource = ManualDownloadResource(file_name="instances_train2017_small.json", instructions='blah')
resource_dp = [images_resource, meta_resource]
return resource_dp
def _segmentation_to_mask(
self, segmentation: Any, *, is_crowd: bool, spatial_size: Tuple[int, int]
) -> torch.Tensor:
from pycocotools import mask
if is_crowd:
segmentation = mask.frPyObjects(segmentation, *spatial_size)
else:
segmentation = mask.merge(mask.frPyObjects(segmentation, *spatial_size))
return torch.from_numpy(mask.decode(segmentation)).to(torch.bool)
def _decode_instances_anns(self, anns: List[Dict[str, Any]], image_meta: Dict[str, Any]) -> Dict[str, Any]:
spatial_size = (image_meta["height"], image_meta["width"])
labels = [ann["category_id"] for ann in anns]
return dict(
segmentations=Mask(
torch.stack(
[
self._segmentation_to_mask(
ann["segmentation"], is_crowd=ann["iscrowd"], spatial_size=spatial_size
)
for ann in anns
]
)
),
areas=Datapoint([ann["area"] for ann in anns]),
crowds=Datapoint([ann["iscrowd"] for ann in anns], dtype=torch.bool),
bounding_boxes=BoundingBox(
[ann["bbox"] for ann in anns],
format="xywh",
spatial_size=spatial_size,
),
labels=Label(labels, categories=self._categories),
super_categories=[self._category_to_super_category[self._categories[label]] for label in labels],
ann_ids=[ann["id"] for ann in anns],
)
def _decode_captions_ann(self, anns: List[Dict[str, Any]], image_meta: Dict[str, Any]) -> Dict[str, Any]:
return dict(
captions=[ann["caption"] for ann in anns],
ann_ids=[ann["id"] for ann in anns],
)
_ANN_DECODERS = OrderedDict(
[
("instances", _decode_instances_anns),
("captions", _decode_captions_ann),
]
)
_META_FILE_PATTERN = re.compile(
rf"(?P<annotations>({'|'.join(_ANN_DECODERS.keys())}))_(?P<split>[a-zA-Z]+)(?P<year>\d+)_small[.]json"
)
def _filter_meta_files(self, data: Tuple[str, Any]) -> bool:
match = self._META_FILE_PATTERN.match(pathlib.Path(data[0]).name)
return bool(
match
and match["split"] == self._split
and match["year"] == self._year
and match["annotations"] == self._annotations
)
def _datapipe(self, resource_dps: List[IterDataPipe]) -> IterDataPipe[Dict[str, Any]]:
images_dp, meta_dp = resource_dps
if self._annotations is None:
dp = hint_shuffling(images_dp)
dp = hint_sharding(dp)
dp = hint_shuffling(dp)
return Mapper(dp, self._prepare_image)
meta_dp = Filter(meta_dp, self._filter_meta_files)
meta_dp = JsonParser(meta_dp)
meta_dp = Mapper(meta_dp, getitem(1))
meta_dp: IterDataPipe[Dict[str, Dict[str, Any]]] = MappingIterator(meta_dp)
#return meta_dp
images_meta_dp, anns_meta_dp = Demultiplexer(
meta_dp,
2,
self._classify_meta,
drop_none=True,
buffer_size=INFINITE_BUFFER_SIZE,
)
images_meta_dp = Mapper(images_meta_dp, getitem(1))
images_meta_dp = UnBatcher(images_meta_dp)
anns_meta_dp = Mapper(anns_meta_dp, getitem(1))
anns_meta_dp = UnBatcher(anns_meta_dp)
anns_meta_dp = Grouper(anns_meta_dp, group_key_fn=getitem("image_id"), buffer_size=INFINITE_BUFFER_SIZE)
anns_meta_dp = hint_shuffling(anns_meta_dp)
anns_meta_dp = hint_sharding(anns_meta_dp)
anns_dp = IterKeyZipper(
anns_meta_dp,
images_meta_dp,
key_fn=getitem(0, "image_id"),
ref_key_fn=getitem("id"),
buffer_size=INFINITE_BUFFER_SIZE,
)
dp = IterKeyZipper(
anns_dp,
images_dp,
key_fn=getitem(1, "file_name"),
ref_key_fn=path_accessor("name"),
buffer_size=INFINITE_BUFFER_SIZE,
)
#return images_dp
return Mapper(dp, self._prepare_sample)
print(datasets.list_datasets())
print(datasets.info('tiny_coco'))
dp = datasets.load("tiny_coco", root="data/vision/tiny_coco")
next(iter(dp))
It feels like quite a lot of effort just to get a COCO TorchData pipeline though, and not sure if many people would go this route π
It feels like quite a lot of effort just to get a COCO TorchData pipeline though, and not sure if many people would go this route sweat_smile
I think you made it more complicated than it has to be:
_resources
. Given that you subclass from the original dataset, they are already implemented and should work without modification.You don't need to register your dataset or the accompanying info. datasets.load
only provides you an interface to load a dataset by name, but you could also instantiate the class the "regular way":
So, unless I'm missing something, the implementation should look like
class TinyCoco(Coco):
def _resources(self) -> List[OnlineResource]:
images_resource = ManualDownloadResource(file_name="train_2017_small", instructions='Blah')
meta_resource = ManualDownloadResource(file_name="instances_train2017_small.json", instructions='blah')
resource_dp = [images_resource, meta_resource]
return resource_dp
dataset = TinyCoco("data/vision/tiny_coco")
And that is not too bad. Of course it doesn't solve the general use case of arbitrary input pipes, but for stuff that you want locally, it should be sufficient
Oh, so I think I ran into problems because my json is named instances_train2017_small.json
. Therefore I needed to edit _META_FILE_PATTERN
to append the _small
, which then required me to copy/paste _ANN_DECODERS
, _decode_instances_anns
and _decode_captions_ann
.
I didn't need to paste _decoder
, _filter_meta_files
, or _segmentation_to_mask
I guess the simplest thing to do is just to stick to the default json naming convention. Thanks for your help!
My much shorter version is now:
import os
import csv
import requests
from typing import Any, List, Union, Sequence, Dict
from torchvision.prototype import datasets
from torchvision.prototype.datasets import register_info, register_dataset, Coco
from torchvision.prototype.datasets.utils import ManualDownloadResource
NAME = "tiny_coco"
@register_info(NAME)
def _info() -> Dict[str, Any]:
return datasets.info('coco')
@register_dataset(NAME)
class TinyCoco(Coco):
def _resources(self) -> List[ManualDownloadResource]:
images_resource = ManualDownloadResource(
file_name="train_2017_small",
instructions="Download from https://github.com/austinmw/tiny-coco"
)
meta_resource = ManualDownloadResource(
file_name="instances_train2017.json",
instructions="Download from https://github.com/austinmw/tiny-coco"
)
resource_dp = [images_resource, meta_resource]
return resource_dp
assert NAME in datasets.list_datasets()
dp = datasets.load("tiny_coco", root="data/vision/tiny_coco")
next(iter(dp))
Would it make sense at all to add a feature request to create a GitHubResource
, or SubversionResource
as additional download routes?
I guess the simplest thing to do is just to stick to the default json naming convention.
Yeah, I've implied this above with me saying you should keep the structure the same. Since we need a way to differentiate the files from each other, file or folder names are often the way to go. On the bright side, I think we never relied on the order of files, so we are never just taking the first file and hope it is the right one. Meaning, you can leave out files that are not actually used by us.
Would it make sense at all to add a feature request to create a GitHubResource, or SubversionResource as additional download routes?
GitHub already provides HTTP direct links for all files, so HTTPResource
should suffice. For example
meta_resource = HttpResource(
"https://raw.githubusercontent.com/austinmw/tiny-coco/master/small_coco/instances_train2017.json"
)
For the images it is a little trickier in your current state. IMO, the best option would be to just zip them like the original dataset does. That way you can use the same technique as above, but for the zip of the images.
If you don't want to do that, something like a GitHubFolderResource
might be an option which gets pointed to a folder and than either traverses it (not sure if the GH API allows that) or just clones the repo and moves the files. That being said, GH repos are really bad for hosting large amounts of binary data and thus regular datasets are hosted elsewhere. Meaning, it makes little sense for us to provide such a utility in our library.
Thanks again, that makes sense. Btw I'm currently downloading the data like this:
svn checkout https://github.com/austinmw/tiny-coco.git/trunk/small_coco ./tiny_coco
:exploding_head: I was not aware GH even has an API for SVN. You might want to switch to git
as the SVN API is deprecated and will be removed in roughly a year.
Ahh, thanks for the heads up. That's a shame.. I couldn't find an analogous one-liner for downloading a specific folder with git
The blog post I linked contains the following paragraph:
Why do people still use Subversion on GitHub, anyhow? Besides simple inertia, there were workflows which Git didnβt support until recently. The main thing we heard when speaking with customers and communities was checking out a subset of the repositoryβa single directory, or only the latest commit. I have good news: with sparse checkout, sparse index, and partial clone, Git can now do a pretty decent job at these workflows.
I'm no expert on this, but it seems one of these commands was built to do just what you need.
π The doc issue
Hi, is it possible to subclass/reuse
torchvision.prototype.datasets.Coco
to create a custom COCO-format DataPipe? For example If I have the following data:Suggest a potential alternative/fix
I think a lot of people could benefit from understanding how to reuse common format classes for DataPipes
cc @pmeier @bjuncek