understanding-search / maze-dataset

maze datasets for investigating OOD behavior of ML systems
16 stars 3 forks source link

faster saving & loading of maze datasets #26

Closed mivanit closed 8 months ago

aaron-sandoval commented 9 months ago

@mivanit, you included the comment in serialize_minimal: "ensure that we run the metadata collection filter, since we throw it out per maze". How do you run this filter? Does it just mean that we can get rid of the generation_metadata_collected item in the dict? Then in load_minimal we would just pass generation_metadata_collected=None to the constructor?

review-notebook-app[bot] commented 9 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

mivanit commented 9 months ago

@mivanit, you included the comment in serialize_minimal: "ensure that we run the metadata collection filter, since we throw it out per maze". How do you run this filter? Does it just mean that we can get rid of the generation_metadata_collected item in the dict? Then in load_minimal we would just pass generation_metadata_collected=None to the constructor?

You should just need to run:

dataset_with_meta = dataset.filter_by.collect_generation_meta()

By default, clear_in_mazes: bool = True -- all metadata will be stripped from individual mazes. You can see usage of this function in the "metadata" section of demo_dataset.ipynb

aaron-sandoval commented 8 months ago

@mivanit Should be ready to merge. A couple of details I want to highlight for your review:

  1. You requested I add a check in MazeDataset.collect_generation_meta to make the function idempotent. All I added was a check if dataset.generation_metadata_collected is not None. Is this sufficient?
  2. Earlier you said you'd like to include profiling of serialize_minimal excluding the time to collect generation metadada. In light of the encouraging profiling results including metadata collection, I didn't implement this because it didn't seem to be very valuable. But if you'd still like to see that happen, I can do so.
afspies commented 8 months ago

Awesome work! <3

aaron-sandoval commented 8 months ago

Interesting timing results in profile_dataset_save_read.ipynb! They look quite different than those in my last commit 3d9500c, and not only because you added way more detail and breadth. I'm guessing I missed some factors when doing the timing that were making the speedups with the new methods look bigger than they really were. I'll have to pick your brain on this at our next meeting.