rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.71k stars 338 forks source link

Streaming refacto #353

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

Refactor img2dataset in 3 parts:

The main ideas here are:

The implementation is mostly working but requires refinement before considering merging.

Was implemented some 8 months ago, but I'm thinking to finish it now

rom1504 commented 1 year ago

339 related issue

rom1504 commented 1 year ago

one alternative I'm considering (but maybe that can simply live next to batch and service instead of replacing them): implement things as tasks and then use a distributed queue like celery to run things. I don't like making installing a service like reddis part of user requirements though

rom1504 commented 1 year ago

maybe ray https://stackoverflow.com/a/54738967 is an alternative to celery

rom1504 commented 1 year ago

https://docs.ray.io/en/latest/ray-core/walkthrough.html

rom1504 commented 10 months ago

https://github.com/veonua/laion-yt-dlp example for celery

rom1504 commented 10 months ago

I think this branch is an interesting POC but it's too many changes

I think a PR that would only extract the core as a standalone function / process would be a good start. It may need dropping the keys based on shard first

rom1504 commented 10 months ago

https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/downloader.py#L29-L51 I'm not sure I like the pydantic part

rom1504 commented 10 months ago

maybe https://abseil.io/docs/python/quickstart is a good alternative

rom1504 commented 10 months ago

https://docs.celeryq.dev/en/stable/userguide/tasks.html celery take on how to configure tasks

rom1504 commented 10 months ago

ray's take https://docs.ray.io/en/latest/ray-core/tasks.html

rom1504 commented 10 months ago

https://docs.celeryq.dev/en/stable/userguide/calling.html#calling-serializers

rom1504 commented 10 months ago

the flags should be much more separated process by process

rom1504 commented 10 months ago

https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/resizer.py#L98 gets duplicated compared to https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/resizer.py#L13C7-L13C22 ; how can we improve it

rom1504 commented 10 months ago

https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/downloader.py#L133 fiddle/omegaconf might automate this line

rom1504 commented 10 months ago

next step here: try to POC a few different ideas of parameter definition instead of pydantic feels like something simpler is possible

rom1504 commented 10 months ago

anything can be generated into a fastapi model at the end if necessary https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/service/service.py#L37

rom1504 commented 10 months ago

maybe check a few projects using ray or celery and see if something helps

rom1504 commented 10 months ago

configuration for processes

rom1504 commented 10 months ago

https://docs.ray.io/en/latest/ray-core/patterns/pipelining.html

rom1504 commented 10 months ago

one idea:

input and output calls are generated based on feature names

so have a generic config for processes

then apply that here but also any2dataset/video2dataset

rom1504 commented 10 months ago

try to write that out as json maybe

so all actual config here would be one json per process

we would also have support top level for the flat args at top level for legacy reasons

rom1504 commented 10 months ago

write it out and see how that plays out for all the projects

rom1504 commented 10 months ago

being able to configure these processes properly here should make all the wrapper around it (service, batch, distributors) come along naturally

rom1504 commented 10 months ago

https://developers.google.com/mediapipe/api/solutions/python/mp/calculators/core/flow_limiter_calculator_pb2/FlowLimiterCalculatorOptions

rom1504 commented 10 months ago

https://github.com/google/mediapipe/blob/master/mediapipe/calculators/tensorflow/matrix_to_tensor_calculator_options.proto

rom1504 commented 10 months ago

https://github.com/google/mediapipe/blob/93290388178395730c9bed0be0042836cf710465/mediapipe/framework/calculator_context.h#L37-L39

rom1504 commented 10 months ago

https://developers.google.com/mediapipe/framework/framework_concepts/calculators.md

rom1504 commented 10 months ago

https://beam.apache.org/documentation/programming-guide/#pardo

rom1504 commented 10 months ago

https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html

rom1504 commented 10 months ago

resizer options:

        image_size,
        resize_mode,
        resize_only_if_bigger,
        upscale_interpolation="lanczos",
        downscale_interpolation="area",
        encode_quality=95,
        encode_format="jpg",
        skip_reencode=False,
        disable_all_reencoding=False,
        min_image_size=0,
        max_image_area=float("inf"),
        max_aspect_ratio=float("inf"),
        blurrer=None, <- this should be replaced by blurrer config or moved out ?

img_stream, blurring_bbox_list=None <- input feature img_str, width, height, original_width, original_height, None <- output feature

this is a process that applies to one image

json for this?

rom1504 commented 10 months ago

maybe a typed thing is better here?

rom1504 commented 10 months ago

https://github.com/rom1504/img2dataset/tree/main/img2dataset

rom1504 commented 10 months ago

https://github.com/iejMac/video2dataset/tree/main/video2dataset/subsamplers

rom1504 commented 10 months ago

write down some of these and some potential schema and find what sticks

rom1504 commented 10 months ago

https://scanner-research.github.io/guide/graphs.html#graphs

rom1504 commented 10 months ago

https://github.com/videoflow/videoflow/tree/master

rom1504 commented 10 months ago

https://www.anyscale.com/blog/streaming-distributed-execution-across-cpus-and-gpus