msmbuilder / msmbuilder-legacy

Legacy release of MSMBuilder
http://msmbuilder.org
GNU General Public License v2.0
25 stars 28 forks source link

Use Pandas for Assignments? #205

Open kyleabeauchamp opened 11 years ago

kyleabeauchamp commented 11 years ago

The current code for building counts looks like this:

    for A in assignments:
        FirstEntry = np.where(A != -1)[0]
        # New Code by KAB to skip pre-padded negative ones.
        # This should solve issues with Tarjan trimming results.
        if len(FirstEntry) >= 1:
            FirstEntry = FirstEntry[0]
            A = A[FirstEntry:]
            C = C + get_counts_from_traj(A, n_states, lag_time=lag_time, sliding_window=sliding_window)  # .tolil()

get_counts_from_traj:

    try:
        C = scipy.sparse.coo_matrix((counts, transitions),
                                    shape=(n_states, n_states))
    except ValueError:
        # Lutz: if we arrive here, there was probably a state with index -1
        # we try to fix it by ignoring transitions in and out of those states
        # (we set both the count and the indices for those transitions to 0)
        mask = transitions < 0
        counts[mask[0, :] | mask[1, :]] = 0
        transitions[mask] = 0
        C = scipy.sparse.coo_matrix((counts, transitions),
                                    shape=(n_states, n_states))
rmcgibbo commented 11 years ago

Can you point to your favorite tutorials or documentation on the pandas Series object?

kyleabeauchamp commented 11 years ago

http://pandas.pydata.org/pandas-docs/dev/dsintro.html

kyleabeauchamp commented 11 years ago

To do this with a 1D pandas Series of assignments, it might look like this:

from_states = ass
to_states = ass.shift(-1)

to_states looks like this:

In [82]: to_states
Out[82]: 
0     2518
1     2997
2     2188
...
22235     67
22236     13
22237     12
22238    NaN
Length: 22239, dtype: float64
rmcgibbo commented 11 years ago

Can you outline the operation to get counts from from_states and to_states? Since it's an ndarray subclass, you can still do np.where on it...

kyleabeauchamp commented 11 years ago

So here is the equivalent of transitions = np.row_stack((from_states, to_states))

transitions = pd.DataFrame([from_states, to_states]).dropna(axis=1)
In [102]: transitions.values
Out[102]: 
array([[ 2195.,  2518.,  2997., ...,   139.,    67.,    13.],
       [ 2518.,  2997.,  2188., ...,    67.,    13.,    12.]])
rmcgibbo commented 11 years ago

On my machine, it appears that you can exactly do transitions = np.row_stack((from_states, to_states)) with from_states and to_states as Series.

kyleabeauchamp commented 11 years ago

One key change is that dropna(axis=1) deals with the NA values. Using the row_stack doesn't deal with the NA.

kyleabeauchamp commented 11 years ago
In [104]: row_stack((from_states, to_states))
Out[104]: 
array([[ 2195.,  2518.,  2997., ...,    67.,    13.,    12.],
       [ 2518.,  2997.,  2188., ...,    13.,    12.,    nan]])
rmcgibbo commented 11 years ago

Oh, I see.

rmcgibbo commented 11 years ago

Okay. I'm sold. I think assignments should be a list of Series. @schwancr ?

rmcgibbo commented 11 years ago

I realize that this does mean introducing a new dependency, but fuckit.

kyleabeauchamp commented 11 years ago

Comes with EPD and Anaconda, so it's not really new...

rmcgibbo commented 11 years ago

Still. Not to be taken lightly. Not everyone is using those, even if we recommend it.

schwancr commented 11 years ago

What do we gain from using pandas.Series? Is it just the timestamp?

kyleabeauchamp commented 11 years ago

Just the timestamp and methods associated with time shifts etc. We should just play with it and see if it's useful to us. I'm still not 100% sold.

schwancr commented 11 years ago

Could we instead do a single pandas array with: traj, frame, time, assignment?

Then we don't need a list of them

kyleabeauchamp commented 11 years ago

We could create an object like that and use DataFrame.pivot_table() to reindex things as needed.

rmcgibbo commented 11 years ago

What you gain is the index, so you're adding some semantic meaning to the difference between assignment[i] and assignments[i+1]. Like, how far is "one index unit". And then when you timeshift them and stuff, this data can be used.

Could we instead do a single pandas array with: traj, frame, time, assignment?

huh?

schwancr commented 11 years ago

You can have a data frame with arbitrary columns. So could we just store a single data frame where each entry has time stamp trajectory and assignment

kyleabeauchamp commented 11 years ago

The idea of pivot_table is to take some "data" columns and convert them into the "index" for the object. I've used it in the context of dealing with NMR data.

For example, my NMR data is a CSV file with a lists of residue ID, atom name, and value. I use pivot_table to convert this object into a single 1D series that is indexed by the tuples (residue_ID, atom_name).

We can then combine this with a group_by operation that will allow us to iterate over the trajectories one by one--even though the actual data_frame is not indexed by trajectory.

kyleabeauchamp commented 11 years ago

I def think this is worth trying.

kyleabeauchamp commented 11 years ago

group_by is the key here.

rmcgibbo commented 11 years ago

Someone should just try to implement something. Then we can see how it works, for realz.

rmcgibbo commented 11 years ago

I am warming to @schwancr's idea that it's an n_snapshots x 3 table where the columns are 'time', 'trajectory_index', 'state_assignment'. That way, if you imagine situations where you are like comparing two MSMs or something and you have dual state decomposition, you could just add a fourth column. I think it's more flexible.

rmcgibbo commented 11 years ago

And having the pd.DataFrame, so that these columns are labeled with titles -- that's critical, since each column is semantically so different.

kyleabeauchamp commented 11 years ago

Here's a half-working prototype. I still need to pivot the ass_i to use the timestamps as index, but it's otherwise working:

data = []
num_traj, traj_length = a.shape
for i in range(num_traj):
    for j, state in enumerate(a[i]):
        if state == -1:
            state = np.nan
        data.append([i, j, state])

X = pd.DataFrame(data, columns=["traj","time","state"])
X["state"]
for (k, ass_i) in X.groupby("traj"):
    print(ass_i["state"])
0      NaN
1     2195
2     2518
3     2997
4     2188
...
22237     67
22238     13
22239     12
Name: state, Length: 22240, dtype: float64
22240     NaN
22241     NaN
22242     NaN
22243     NaN
22244     715
22245    2250
...
44477   NaN
44478   NaN
44479   NaN
Name: state, Length: 22240, dtype: float64
kyleabeauchamp commented 11 years ago

OK, so one problem with my current approach is that integer series are incompatible with NANs (same as Numpy). Thus, we can't just throw NANs in.

kyleabeauchamp commented 11 years ago

I don't think this should be a problem, though.

kyleabeauchamp commented 11 years ago

Here's the updated version to index the individual trajectory assignments by their timestamps:

X = pd.DataFrame(data, columns=["traj","time","state"])
X["state"]
for (k, ass_i) in X.groupby("traj"):
    ass_i = ass_i.pivot_table(rows=["time"])["state"]
    print(ass_i)
kyleabeauchamp commented 11 years ago

So this is how we do the iteration over trajectories. It's slightly more complicated, but we've eliminated all the -1 logic, and we now have a way to automatically handle end cases when timeshifting.

kyleabeauchamp commented 11 years ago

Sigh, there's still an issue with the NAN. When we do the time-shifting, there's a automatic cast to float:

ass_i.shift(-1)
[...]
Name: state, Length: 21436, dtype: float64
rmcgibbo commented 11 years ago

@kyleabeauchamp: do you want to take this through to a pull request?

kyleabeauchamp commented 11 years ago

I'm trying to stay PR-free for the next two weeks.

rmcgibbo commented 11 years ago

You already did most of the work :)

kyleabeauchamp commented 11 years ago

OK, maybe.

schwancr commented 11 years ago

Do you have to store the NaN's at all? Or can you skip them in the dataframe? Maybe this causes issues, but I thought that was the power behind storing the timestep.

kyleabeauchamp commented 11 years ago

So we skip them in the dataframe, but they get created when we do the timestamp slicing.

For example, when we apply the timeshift operation (to move the lagtime forward), the values at the end of the Series get set to NAN (because there is no future value for the final assignment entry).

Hopefully we can just go back and forth between int and float without difficulty.

rmcgibbo commented 11 years ago

Generally, no NaNs are stored. But when you do the forward or backwards index shift to line up two series at an offset, you get a few nans at the end to describe the "hanging tails" where there's no corresponding data between the two series. But there are no NaNs corresponding to the gratuituous extra -1s.