Open rom1504 opened 1 year ago
one alternative I'm considering (but maybe that can simply live next to batch and service instead of replacing them): implement things as tasks and then use a distributed queue like celery to run things. I don't like making installing a service like reddis part of user requirements though
maybe ray https://stackoverflow.com/a/54738967 is an alternative to celery
https://github.com/veonua/laion-yt-dlp example for celery
I think this branch is an interesting POC but it's too many changes
I think a PR that would only extract the core as a standalone function / process would be a good start. It may need dropping the keys based on shard first
https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/downloader.py#L29-L51 I'm not sure I like the pydantic part
maybe https://abseil.io/docs/python/quickstart is a good alternative
https://docs.celeryq.dev/en/stable/userguide/tasks.html celery take on how to configure tasks
the flags should be much more separated process by process
https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/resizer.py#L98 gets duplicated compared to https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/resizer.py#L13C7-L13C22 ; how can we improve it
https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/downloader.py#L133 fiddle/omegaconf might automate this line
next step here: try to POC a few different ideas of parameter definition instead of pydantic feels like something simpler is possible
anything can be generated into a fastapi model at the end if necessary https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/service/service.py#L37
maybe check a few projects using ray or celery and see if something helps
configuration for processes
one idea:
input and output calls are generated based on feature names
so have a generic config for processes
then apply that here but also any2dataset/video2dataset
try to write that out as json maybe
so all actual config here would be one json per process
we would also have support top level for the flat args at top level for legacy reasons
write it out and see how that plays out for all the projects
being able to configure these processes properly here should make all the wrapper around it (service, batch, distributors) come along naturally
resizer options:
image_size,
resize_mode,
resize_only_if_bigger,
upscale_interpolation="lanczos",
downscale_interpolation="area",
encode_quality=95,
encode_format="jpg",
skip_reencode=False,
disable_all_reencoding=False,
min_image_size=0,
max_image_area=float("inf"),
max_aspect_ratio=float("inf"),
blurrer=None, <- this should be replaced by blurrer config or moved out ?
img_stream, blurring_bbox_list=None <- input feature img_str, width, height, original_width, original_height, None <- output feature
this is a process that applies to one image
json for this?
maybe a typed thing is better here?
write down some of these and some potential schema and find what sticks
Refactor img2dataset in 3 parts:
The main ideas here are:
The implementation is mostly working but requires refinement before considering merging.
Was implemented some 8 months ago, but I'm thinking to finish it now