Strided dataset feature branch

Diego-Llanes commented 4 months ago

This pull request contains the StridedDataset feature addition. The only file that have additions / changes is src/neuromancer/dataset.py. All of the class parameters are well documented and have docstrings.

EDIT: There is now also an example documenting the usecase of this new feature.

Contributors:

@Seth1Briney
@HarryLTS
@Diego-Llanes

madelynshapiro commented 4 months ago

@Diego-Llanes would you please upload a small example (can be a modified snippet from one of your full notebooks) demonstrating the usage to accompany the PR?

Diego-Llanes commented 4 months ago

@Diego-Llanes would you please upload a small example (can be a modified snippet from one of your full notebooks) demonstrating the usage to accompany the PR?

Just added!

RBirmiwal commented 4 months ago

strided_dataset_rahul_test.ipynb.zip

I agree with Jan. In addition have attempted to play around with the strided dataset to ensure it works. I have created a notebook zipped up to this comment. I am still confused at times:

In DictDataset, let's say D:= a dictDataset, i can do D['X'] --> 3D tensor
In StridedDataset S:= StridedDataset, I cannot do S['X'], i have to do S[idx]['X] --> 2D tensors

While the user can figure out how to do proper reshaping and data creation, this is a difference in our API

what if instead we had S['X'] --> list of subsequence tensors for X
Also the tensor X is 2D not 3D......

I attempted to do Neural ODE example with StridedDataset. For L >= nsteps looks like it trains, but the performance is not as good as standard DictDataset.

For L < nsteps, it breaks.

I like this functionality, but we need to ensure it works and performs on-par for all the neuromancer use-cases not just Farama DPC. If there are situations where the StridedDataset will fail (as indicated when L > nsteps), then those need to be handled appropriately.

RBirmiwal commented 4 months ago

Also would like an example of a non-trivial update_fn, right now it is

def update_initial_condition(d):
    d['xn_2'] = d["xn"][0:1, :]
    return d

for the case of the neural ode where our keys are "X" and "xn". I don't understand how "xn_2" would play a role/being used. So what are cases where this would be necessary?

RBirmiwal commented 3 months ago

@Diego-Llanes . I am closing this for now. If you have bandwidth to fix the above issues/design choices please re-create the PR. Thank you

pnnl / neuromancer

Strided dataset feature branch #162