Redundancy in the naming structure of synthesized data

havanagrawal commented 5 years ago

Problem

The synthesized data directory looks something like: data/synth_data_2019_01_09_22_45_08/image_0_2019_01_09_22_45_08/train_image/image_0_2019_01_09_22_45_08.jpg

This feels incredibly noisy to me. Can we instead favor something more concise, such as: data/synth_data_2019_01_09_22_45_08/image_0/train_image/image_0.jpg

In other words, I don't see the point of embedding the timestamp at three levels of the path.

@vivanvish Was there any particular reason (that perhaps I am completely missing) for requiring the timestamp at each level?

Solution

Change the image filename format to: data/synth_data_{timestamp}/image_{k}/train_image/image_{k}.jpg

@pshivraj I'm assuming this makes no difference to your training pipeline, since afair you were using os.walk?

vivanvish commented 5 years ago

@havanagrawal The reason I added the timestamps to the image names as well, was to ensure that we had unique image names across datasets. It would avoid name collisions if we ever wanted to create a single dataset by copying images from the different ones. Does that make sense?

havanagrawal commented 5 years ago

Interesting. In that case, we can remove it at the lowest level, while retaining the ability to merge datasets. Going from data/synth_data_2019_01_09_22_45_08/image_0_2019_01_09_22_45_08/train_image/image_0_2019_01_09_22_45_08.jpg to data/synth_data_2019_01_09_22_45_08/image_0_2019_01_09_22_45_08/train_image/image_0.jpg should be possible, right?

pshivraj commented 5 years ago

I would just want the class id's delimiters to be '___' triple underscore in the train_mask folder.

havanagrawal commented 5 years ago

As discussed, the timestamps are necessary especially when we generate data in parallel and then merge them at the end. Closing.

havanagrawal / clomask

Redundancy in the naming structure of synthesized data #30