tensorflow / similarity

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.
Apache License 2.0
1.01k stars 104 forks source link

Difficult with use "tfsim.samplers.TFRecordDatasetSampler" #316

Closed tonylincon1 closed 1 year ago

tonylincon1 commented 1 year ago

Hello, how are you?

I really like the tensorflow similarity solution for making recommendations, however I am having a hard time using tfsim.samplers.TFRecordDatasetSampler as I have a lot of data to keep in memory.

I tried the following way to save ".tfrecords" files:

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))): # if value ist tensor
        value = value.numpy() # get value of tensor
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a floast_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def serialize_array(array):
  array = tf.io.serialize_tensor(array)
  return array

def parse_single_image(image, label):

  #define the dictionary -- the structure -- of our single example
  data = {
        'height' : _int64_feature(image.shape[0]),
        'width' : _int64_feature(image.shape[1]),
        'depth' : _int64_feature(image.shape[2]),
        'raw_image' : _bytes_feature(serialize_array(image)),
        'label' : _int64_feature(label)
    }
  #create an Example, wrapping the single features
  out = tf.train.Example(features=tf.train.Features(feature=data))

  return out

def write_images_to_tfr_short(images, labels, filename:str="images"):
  filename= filename+".tfrecords"
  writer = tf.io.TFRecordWriter(filename) #create a writer that'll store our data to disk
  count = 0

  for index in range(len(images)):

    #get the data we want to write
    current_image = images[index] 
    current_label = labels[index]

    out = parse_single_image(image=current_image, label=current_label)
    writer.write(out.SerializeToString())
    count += 1

  writer.close()
  print(f"Wrote {count} elements to TFRecord")
  return count

count = write_images_to_tfr_short(x_train, y_train, filename=f"{data_path}small_images")

From this I was able to save two files with my images and then write the unerecording function

  #use the same structure as above; it's kinda an outline of the structure we now want to create
  data = {
      'height': tf.io.FixedLenFeature([], tf.int64),
      'width':tf.io.FixedLenFeature([], tf.int64),
      'label':tf.io.FixedLenFeature([], tf.int64),
      'raw_image' : tf.io.FixedLenFeature([], tf.string),
      'depth':tf.io.FixedLenFeature([], tf.int64),
    }

  content = tf.io.parse_single_example(element, data)

  height = content['height']
  width = content['width']
  depth = content['depth']
  label = content['label']
  raw_image = content['raw_image']

  #get our 'feature'-- our image -- and reshape it appropriately
  feature = tf.io.parse_tensor(raw_image, out_type=tf.uint8)
  feature = tf.reshape(feature, shape=[height,width,depth])
  return (feature, label)

def get_dataset_small(filename):
  #create the dataset
  dataset = tf.data.TFRecordDataset(filename)

  #pass every single feature through our mapping function
  dataset = dataset.map(
      parse_tfr_element
  )

  return dataset

When I try to use tfsim.samplers.TFRecordDatasetSampler the following error occurs

sampler = tfsim.samplers.TFRecordDatasetSampler(
    shard_path=data_path,
    deserialization_fn=get_dataset_small,
)

InvalidArgumentError: buffer_size must be greater than zero. [Op:ShuffleDatasetV3]
tonylincon1 commented 1 year ago

Something response?

owenvallis commented 1 year ago

Hi Tony,

Thanks for using TF Sim. I think the issue might be with how you are writing your tf record files. The TFRecordDatasetSampler uses the interleave function to randomly sample examples from K different tfrecord files, where K is equal to or greater than the number of classes you have in your dataset. Additionally, the length of each tfrecord file must be an integer multiple of the number of examples per class per batch.

Let me know if that unblocks you and see here for more details. https://github.com/tensorflow/similarity/issues/171 and https://github.com/tensorflow/similarity/issues/213

owenvallis commented 1 year ago

An alternative approach is to use the in memory sampler (or the new tf.data.Dataset sampler I'm working on in this branch). You can pass the URI to the images as the X values and then use the load_fn to read them per batch. See here for a working version using the MultiShotMemorySampler for images.

tonylincon1 commented 1 year ago

I've will try the MultiShotMemorySampler, Thank you for reponse xD

owenvallis commented 1 year ago

Thanks. Closing this for now but let us know if you run into any other issues.