Added caching capabilities to Datasets.

WJCIV commented 8 years ago

Added the ability to write a dataset to disk (freeing up the associated memory used to store the content, but not the metadata), return a Dataset from disk to memory, and check if a Dataset is currently in memory (or conversely, on disk). When a Dataset is on disk subsequent additions are appended to the file containing the data rather than storing them in memory. All three functions are wrapped in the DataReferenceManager as well so changes can be made using the Dataset identifier instead of requiring a pointer directly to the Dataset.

WJCIV commented 8 years ago

I'm not completely convinced this is ready for prime time just yet. This pull request is currently just a focal point for some conversation.

I've tried the following (by injecting code into the DataReferenceManager which has since been pulled out): -Send a Dataset to disk and never retrieve it. Make sure no data is processed (since it is never in memory) and the output file contains the data that was created. -Send a Dataset to disk and retrieve it. Both commands were executed before data was written to disk. I will run one after data is written to disk tomorrow. -Have a dummy file with cached data from a previous run. "Retrieve" this from disk and process it.

All tests above run as expected. Are there any other tests that make sense?

WJCIV commented 8 years ago

Successfully tested a write to disk after the Dataset was populated in memory.

raulcf commented 8 years ago

Great.

I think I like that the functions to send back and forth Datasets from disk to memory and so on are in DataReferenceManager.

However, I think that the implementation of those methods should not be part of Dataset. Dataset should ideally be only a bunch of Buffers (keep in mind Dataset is in a very early state, and it will change). One quick fix is to create a class that implements those methods. Such methods receive a dataset, a target file, etc. And then your implementation applies as it is.

Now, this will help us to change how we move Datasets from memory to disk and the other way around. For example, what if we decide to use MemoryMappedFiles instead? This should not change anything in Dataset.

So, we would have a new class X. DataReferenceManager would have a reference of this class. And when someone says DataReferenceManager.sendDataSetToDisk(id), the implementation will do something like X.sendToDisk(dataset, file). Rough explanation, but I hope the idea comes across.

Also, what if we want to read data directly from disk and not from memory? By using X, we can keep the dataset on disk and prefetch data to the buffer as needed. Note: I am not suggesting to implement this now, only that with X it would be easy to do if we need it in the future.

WJCIV commented 8 years ago

A little bit of the functionality had to stay in the Dataset. Otherwise there is no way to append new data to something which was cached to disk (new data would stay in memory and jump the queue to the front). Depending how multithread safe we want it there may still be problems with data being reordered. There shouldn't be any problems with losing or duplicating data, though.

raulcf commented 8 years ago

Otherwise there is no way to append new data to something which was cached to disk (new data would stay in memory and jump the queue to the front).

Ok I understand.

I think this is very useful to think very well what exactly we need. For example, you made the assumption that we want to append to disk. There is a good chance we want to do that, when a dataset is meant to stay on disk, but what if we want finer-granularity? In that case, which mechanism can we use for the data in memory to not "jump the queue". It's easy to fix, but only worth doing if we want to go that way.

On one hand I think we'll have a better picture once things are working and we see the performance issues, etc. On the other hand I think it's worth thinking about this now so that we can write the right abstractions to make changes later easier.

Let's use next meeting to brainstorm all possibilities here. thanks!

WJCIV commented 8 years ago

My thinking is that when a Dataset is returned to memory you want it to retain the same properties as it would have had if it had never been sent to disk in the first place. So it should have the same records in the same order as an alternative version which never got cached. Currently there is a slight race condition with DiskCacher.sendToDisk and Dataset.write which can violate the ordering property, but given that we don't seem to worry about multithreading within Dataset accesses elsewhere I didn't worry about it here.

I also thought that if a Dataset was cached to disk then the entirety of the known data should reside there, regardless of whether the data arrived before or after the caching. Especially if we are using this to control memory usage it doesn't make a lot of sense to leave stuff hanging in memory and require another call to free up more memory.

WJCIV commented 8 years ago

I've written the tests, but I haven't run them. Where is the file that actually runs all of the tests, so I can add this one?

raulcf commented 8 years ago

Running test is not automated. I simply run them individually as needed. I have all of them as JUnit tests (except for integration tests) and that works pretty well. The only important bit is that they should live in the tests directory (and being JUnit would also help) so that we can automate this with graddle once we have a stable version.

WJCIV commented 8 years ago

I think this is now ready to merge.

raulcf commented 8 years ago

This has now been merged (after solving minor conflicts), thanks!

raulcf / SEEPng

Added caching capabilities to Datasets. #65