Open kyleabeauchamp opened 11 years ago
Can you point to your favorite tutorials or documentation on the pandas Series object?
To do this with a 1D pandas Series of assignments, it might look like this:
from_states = ass
to_states = ass.shift(-1)
to_states looks like this:
In [82]: to_states
Out[82]:
0 2518
1 2997
2 2188
...
22235 67
22236 13
22237 12
22238 NaN
Length: 22239, dtype: float64
Can you outline the operation to get counts
from from_states
and to_states
? Since it's an ndarray subclass, you can still do np.where on it...
So here is the equivalent of transitions = np.row_stack((from_states, to_states))
transitions = pd.DataFrame([from_states, to_states]).dropna(axis=1)
In [102]: transitions.values
Out[102]:
array([[ 2195., 2518., 2997., ..., 139., 67., 13.],
[ 2518., 2997., 2188., ..., 67., 13., 12.]])
On my machine, it appears that you can exactly do transitions = np.row_stack((from_states, to_states))
with from_states
and to_states
as Series.
One key change is that dropna(axis=1) deals with the NA values. Using the row_stack
doesn't deal with the NA.
In [104]: row_stack((from_states, to_states))
Out[104]:
array([[ 2195., 2518., 2997., ..., 67., 13., 12.],
[ 2518., 2997., 2188., ..., 13., 12., nan]])
Oh, I see.
Okay. I'm sold. I think assignments should be a list of Series. @schwancr ?
I realize that this does mean introducing a new dependency, but fuckit.
Comes with EPD and Anaconda, so it's not really new...
Still. Not to be taken lightly. Not everyone is using those, even if we recommend it.
What do we gain from using pandas.Series? Is it just the timestamp?
Just the timestamp and methods associated with time shifts etc. We should just play with it and see if it's useful to us. I'm still not 100% sold.
Could we instead do a single pandas array with: traj, frame, time, assignment?
Then we don't need a list of them
We could create an object like that and use DataFrame.pivot_table() to reindex things as needed.
What you gain is the index, so you're adding some semantic meaning to the difference between assignment[i] and assignments[i+1]. Like, how far is "one index unit". And then when you timeshift them and stuff, this data can be used.
Could we instead do a single pandas array with: traj, frame, time, assignment?
huh?
You can have a data frame with arbitrary columns. So could we just store a single data frame where each entry has time stamp trajectory and assignment
The idea of pivot_table is to take some "data" columns and convert them into the "index" for the object. I've used it in the context of dealing with NMR data.
For example, my NMR data is a CSV file with a lists of residue ID, atom name, and value. I use pivot_table to convert this object into a single 1D series that is indexed by the tuples (residue_ID, atom_name).
We can then combine this with a group_by
operation that will allow us to iterate over the trajectories one by one--even though the actual data_frame is not indexed by trajectory.
I def think this is worth trying.
group_by is the key here.
Someone should just try to implement something. Then we can see how it works, for realz.
I am warming to @schwancr's idea that it's an n_snapshots x 3 table where the columns are 'time', 'trajectory_index', 'state_assignment'. That way, if you imagine situations where you are like comparing two MSMs or something and you have dual state decomposition, you could just add a fourth column. I think it's more flexible.
And having the pd.DataFrame, so that these columns are labeled with titles -- that's critical, since each column is semantically so different.
Here's a half-working prototype. I still need to pivot the ass_i to use the timestamps as index
, but it's otherwise working:
data = []
num_traj, traj_length = a.shape
for i in range(num_traj):
for j, state in enumerate(a[i]):
if state == -1:
state = np.nan
data.append([i, j, state])
X = pd.DataFrame(data, columns=["traj","time","state"])
X["state"]
for (k, ass_i) in X.groupby("traj"):
print(ass_i["state"])
0 NaN
1 2195
2 2518
3 2997
4 2188
...
22237 67
22238 13
22239 12
Name: state, Length: 22240, dtype: float64
22240 NaN
22241 NaN
22242 NaN
22243 NaN
22244 715
22245 2250
...
44477 NaN
44478 NaN
44479 NaN
Name: state, Length: 22240, dtype: float64
OK, so one problem with my current approach is that integer
series are incompatible with NANs (same as Numpy). Thus, we can't just throw NANs in.
I don't think this should be a problem, though.
Here's the updated version to index the individual trajectory assignments by their timestamps:
X = pd.DataFrame(data, columns=["traj","time","state"])
X["state"]
for (k, ass_i) in X.groupby("traj"):
ass_i = ass_i.pivot_table(rows=["time"])["state"]
print(ass_i)
So this is how we do the iteration over trajectories. It's slightly more complicated, but we've eliminated all the -1
logic, and we now have a way to automatically handle end cases when timeshifting.
Sigh, there's still an issue with the NAN. When we do the time-shifting, there's a automatic cast to float:
ass_i.shift(-1)
[...]
Name: state, Length: 21436, dtype: float64
@kyleabeauchamp: do you want to take this through to a pull request?
I'm trying to stay PR-free for the next two weeks.
You already did most of the work :)
OK, maybe.
Do you have to store the NaN's at all? Or can you skip them in the dataframe? Maybe this causes issues, but I thought that was the power behind storing the timestep.
So we skip them in the dataframe, but they get created when we do the timestamp slicing.
For example, when we apply the timeshift operation (to move the lagtime forward), the values at the end of the Series get set to NAN (because there is no future value for the final assignment entry).
Hopefully we can just go back and forth between int and float without difficulty.
Generally, no NaNs are stored. But when you do the forward or backwards index shift to line up two series at an offset, you get a few nans at the end to describe the "hanging tails" where there's no corresponding data between the two series. But there are no NaNs corresponding to the gratuituous extra -1s.
The current code for building counts looks like this:
get_counts_from_traj: