How to efficiently iterate on new dataset versions?

peter-mitrano-bg commented 2 months ago

I'm struggling to figure out how to quickly process/generate changes to my datasets using tfds and the Octo dataloader. I see that Octo's data processing/loader has the following code to handle loading a image from a file path, instead of having the actual image data. Is this an efficient approach? If so, how do I configure my Features dict to have tfds correctly handle the path to the image?

        if image.dtype == tf.string:
            if tf.strings.length(image) == 0:
                # this is a padding image
                image = tf.zeros((*resize_size.get(name, (1, 1)), 3), dtype=tf.uint8)
            else:
                image = tf.io.decode_image(
                    image, expand_animations=False, dtype=tf.uint8
                )

Otherwise, the tfds build step seems to load every single image in my dataset when I build, which takes a very long time and results in the tfrecords files being huge (because they are a copy of the data included in the original image files).

I would like to iterate quickly on small-ish datasets and I'm finding that the tfds step is a huge time sink. Looking for some suggestions here! Thanks!

dibyaghosh commented 2 months ago

tf.io.decode_image doesn't load images from the filesystem, it just decodes raw JPEG bytes that are already stored in the tfrecord.

Re: tfrecord size In TFDS, images are stored either as encoded image bytes (when using tfds.features.Image) or as arrays (e.g. when using tfds.features.Tensor(shape=(256, 256, 3), dtype=uint8). When images are stored as encoded image bytes, the tfrecords shouldn't be any larger than the "raw" data that you're storing. If your tfrecords are getting larger than your raw data, something to investigate there.

Re: tfds build time: I'm not sure what builder you're using to create your tfds dataset, but one gotcha to be aware of:

def _generate_examples(...):
  for traj in trajs:
     images = [Image.load(fname) for fname in fnames]
     yield {
        "image":  images
      }

      # VS
      yield: {
         "image": fnames
      }

In pattern 1, the build process needs to load and then re-encode all the images during build time which can take a potentially long time (esp b/c this isn't parallelized). In Pattern 2, tfds will directly store the image bytes from the filename (without decoding the image) which is usually pretty fast in the usecases I've tried. For small datasets, if you're not encoding / decoding during build time, build speeds should be pretty speedy.

(In theory, your proposed idea of just saving the filename could work, and then recover it by doing tf.io.read_file(image_fname), but this approach doesn't mesh with a lot of the things that make tfds nice)

peter-mitrano-bg commented 2 months ago

Thanks for the response! I am using tfds.features.Image, and I'm not making the mistake you pointed out here (I learned that myself the hard way already).

BUT it seems I was measuring incorrectly -- the size of the tfrecords is very similar to the size of the original data. If we have to load and resave all the images, then the current runtime is appropriate.

Can you say more about why using tf.io.read_file to load wouldn't mesh well with tfds? It seems silly to load, decode, re-encode, and save thousands of images.

dibyaghosh commented 2 months ago

Can you say more about why using tf.io.read_file to load wouldn't mesh well with tfds? It seems silly to load, decode, re-encode, and save thousands of images.

Obvious disclaimer is that I've never actually tried this workflow (and it might work for you, especially for fast debugging).

Two main places TFDS shines is 1) Portability / Shareability and 2) Efficiency on "slow" filesystems.

So for (1), if you hard-code paths to image files, then it's hard to share the tfrecords w/ a new person, or to a new machine / or to networked storage -- you'd have to separately also copy the image folder, and then match the location of the image files exactly on every machine, or use an env variable or something (this seems unclean).

For (2), we often host our data on cloud storage (like GCS) or on networked NFS shares rather than on local SSDs. For these filesystems, it's much faster to load 1 100 Mb file than it is to load 1000 100Kb files -- so we'd see performance regressions if we tried reading each individual image separately than 1 tfrecord at once.

If these aren't concerns for you, it might work for your personal workflow.

peter-mitrano-bg commented 2 months ago

Thanks -- I consider this question answered!

octo-models / octo

How to efficiently iterate on new dataset versions? #130