what's the meaning of the naming convention for TFRecord files

liuyx599 commented 1 year ago

Hi, long time no see. I encountered some confusion while studying your cool work.

When I was running the programTrain a SegCLR embedding model in
SegCLR wiki， I ran the following code:

tfrecord_pathspec = 'gs://h01-release/data/20230118/training_data/c3_positive_pairs/goog14c3_max200000_skip50.tfrecord@1000'
tfrecord_files = data_input_util.expand_pathspec(tfrecord_pathspec)

The above statement indicates that the program will load the training samples from Google Cloud h01-release&prefix=&forceOnObjectsSortingFiltering=false).

Perhaps as a beginner in TensorFlow2, I don't quite understand the meaning of the sample file name goog14c3_max200000_skip50.tfrecord-00000-of-01000 in Google Cloud Storage. For example, what does max200000 mean and what does skip50 signify? It seems that 0000-of-01000 indicates that this is the 0th file out of 1000 samples, because there are a total of 1000 TFRecord files in that directory. I was surprised to find that they all seem to be around 1.7G in size. Does that mean that each TFRecord file represents randomly sampled pair information from a segment? So in SegCLR, a total of 1000 segments were collected from H01, and the number of pairs sampled from each segment was the same, resulting in the basic consistency of each TFRecord's size.

There may be some misunderstandings in my understanding. Please help me identify them.

chinasaur commented 1 year ago

These are sharded files, so '@1000' means the dataset is split into 1000 different files. You then also see this in the individual file names '00000-of-01000' etc. The training examples are distributed to the 1000 different shards using a fingerprint hashing function designed to distribute them roughly evenly. There are about 9.1 billion total training examples across the 1000 shards, sampled from skeletons all over the h01 volume.

You can see a roughly similar sharding scheme (but not using the same hashing function) in the CSV ZIP archives of the embedding outputs that are demoed here using the simple reader library here.

chinasaur commented 1 year ago

The max200000 and skip50 in the name refer to hyperparameters chosen in the extraction of the example pairs. I'm pretty sure max200k means that the maximum pair distance sampled is 200 um. I think skip50 means that in the sampling of nodes to form example pairs skeletons were first subsampled to 1/50 of the total skeleton nodes. @sdorkenw could say more or correct if I have misinterpreted the meanings.

liuyx599 commented 1 year ago

These are sharded files, so '@1000' means the dataset is split into 1000 different files. You then also see this in the individual file names '00000-of-01000' etc. The training examples are distributed to the 1000 different shards using a fingerprint hashing function designed to distribute them roughly evenly. There are about 9.1 billion total training examples across the 1000 shards, sampled from skeletons all over the h01 volume.

You can see a roughly similar sharding scheme (but not using the same hashing function) in the CSV ZIP archives of the embedding outputs that are demoed here using the simple reader library here.

Thank you very much for your response. I think I understand a bit now. So it also means that these 1000 files do not correspond to 1000 segments, but rather all the sampled pairs are combined and averaged into 1000 files. You mentioned that there are approximately 9.1 billion training files in total, does this already include all the h01 embeddings?

As mentioned in the code link you quoted, I have studied the sharding scheme and learned that each compressed file is named using a hash value. The embeddings of different segment IDs are hashed and stored in the corresponding hash value zip file.(Thank you very, very much agin)

By the way, how can I create a TFRECORD file like yours, that is, how can I sample my skeleton and encode the data as a TFRECORD?

liuyx599 commented 1 year ago

The max200000 and skip50 in the name refer to hyperparameters chosen in the extraction of the example pairs. I'm pretty sure max200k means that the maximum pair distance sampled is 200 um. I think skip50 means that in the sampling of nodes to form example pairs skeletons were first subsampled to 1/50 of the total skeleton nodes. @sdorkenw could say more or correct if I have misinterpreted the meanings.

Thank you very much. As mentioned in the paper, the distances between pairs are roughly evenly distributed in the intervals [0, 10000, 30000, 100000, 150000], so max200000 means that the maximum distance between pairs is 20000, which suddenly makes sense! And skip50 may be because the skeleton nodes are too dense, so down-sampling was performed. Once again, thank you for your answer!

chinasaur commented 1 year ago

So it also means that these 1000 files do not correspond to 1000 segments, but rather all the sampled pairs are combined and averaged into 1000 files. You mentioned that there are approximately 9.1 billion training files in total, does this already include all the h01 embeddings?

This TFRecord table is positive pair examples used for training SegCLR de novo. The precomputed output embeddings (~4 B total for h01) are available separately in the CSV ZIP sharded archives from the other notebook.

The embeddings of different segment IDs are hashed and stored in the corresponding hash value zip file.

Right, the precomputed embedding CSV ZIPs are sharded according to segment IDs with a known sharding function to allow you to look up the right shard for a given ID (all handled by the simple reader module). For the training examples TFRecord table I would guess they are sharded by example pair, so the pairs sampled from a given skeleton would be scattered all throughout the shards.

By the way, how can I create a TFRECORD file like yours, that is, how can I sample my skeleton and encode the data as a TFRECORD?

You can refer to the TensorFlow documentation here and use the format of the TFRecords in our demo table as a guide for how to structure your examples.

liuyx599 commented 1 year ago

So it also means that these 1000 files do not correspond to 1000 segments, but rather all the sampled pairs are combined and averaged into 1000 files. You mentioned that there are approximately 9.1 billion training files in total, does this already include all the h01 embeddings?

This TFRecord table is positive pair examples used for training SegCLR de novo. The precomputed output embeddings (~4 B total for h01) are available separately in the CSV ZIP sharded archives from the other notebook.

The embeddings of different segment IDs are hashed and stored in the corresponding hash value zip file.

Right, the precomputed embedding CSV ZIPs are sharded according to segment IDs with a known sharding function to allow you to look up the right shard for a given ID (all handled by the simple reader module). For the training examples TFRecord table I would guess they are sharded by example pair, so the pairs sampled from a given skeleton would be scattered all throughout the shards.

By the way, how can I create a TFRECORD file like yours, that is, how can I sample my skeleton and encode the data as a TFRECORD?

You can refer to the TensorFlow documentation here and use the format of the TFRecords in our demo table as a guide for how to structure your examples.

Okay, great. Through step-by-step debugging and consulting relevant documentation, I have roughly understood the data structure inside the TFRecord. Because the TFRecord records the coordinates of two nodes center-0 and center-1, as well as the segment ID skeleton_id, coupled with other data loading programs, we can obtain the corresponding EM, mask, and masked EM data. This is very cool.

However, I still have a question about how to determine center-0 and center-1 based on the existing EM volume. In the "Training SegCLR embedding networks" section of the Methods section in the paper, it is mentioned that "We also leveraged the segmentation and corresponding skeletonization to generate example pairs for contrastive training. For an arbitrary segment, we picked a 3d view to be centered on an arbitrary skeleton node, and then picked a positive pair location centered on a second node within 150 μm path length away on the same skeleton." This seems to reveal how to obtain the corresponding center-0 and center-1 in the TFRecord, but it is still a bit abstract for me. If there is corresponding code, perhaps it will be easier to understand. In other words, is there any code available on how the sampling was done to record the attributes of each pair, such as center-0, center-1, skeleton_id, etc.? As mentioned in the paper, it is also important to ensure that "we sorted these positive pairs into four distance buckets from which we drew uniformly. The bucket boundaries were (0, 2500, 10000, 30000, 150000) nanometers."

chinasaur commented 1 year ago

We used the existing skeletonization of the h01 c3 segmentation. This was done via the improved TEASAR method implemented in the kimimaro package. Once you have skeletons, you can sample the nodes as appropriate for your dataset and then compute all the neighbors within 150 um and record their distances for bucketing purposes.

Technically it's not necessary to start from skeletons; you could also sample directly from the segmentation masks, but we found the skeletons convenient and it may be that biasing to sampling from the centerline of the objects helps.

google-research / connectomics

what's the meaning of the naming convention for TFRecord files #86