A lightweight data structure?

meng-du commented 5 years ago

I'm working on identifying/matching the same person across different frames, so I'd need to load many frames at a time, which makes pandas df seem a bit too heavyweight to work with.. It has a lot of overhead info that takes up most of the memory but are not the most useful (?) because of the duplicated data (filename, frame #, etc. are duplicated 75 times for each frame). When analyzing across frames I feel like the most useful information could be in just a 3D numpy array, where the 3 dimensions are frames, people in each frame, and keypoints for each person. The ID info (person_id, pose_keypoints_2d and frame) could just be translated to the index of the array, and doesn't need the take up any space. I think removing the duplicated columns would reduce ~80% of the memory overhead of a pandas df...

I was thinking perhaps we could just load json to pandas df initially, and maybe after being able to match people across frames, store the data in a numpy array instead of pandas df?

I'm not sure if using a 3D array is the best way tho, cuz the shape is fixed.. We'd then need to deal with a column for every possible person in every frame row, even if the person didn't appear in that frame (maybe it would work if we just put nan in those cells?)

I was also thinking that if this is to be developed into a package (not sure about that but just pretending it's a real package instead of a hackathon project for now), it would be convenient to be able to save intermediate results for users (e.g. saving the results after loading the json and match people across frames), so that they don't do these same things every time before starting to analyze data (I'm assuming matching people across frames would take a while maybe). So it would also be nice to have small-ish file sizes to read and write to..

Just feel like it's nice to decide how the data is structured around the beginning of the project because all functions in the package would then need to manipulate the same data structure.. (and using pandas to work across frames seems a bit hard..? 😛)

jcheong0428 commented 5 years ago

These are great points! I totally agree that we should think about this more carefully, perhaps this afternoon.

For clarity, currently there are two ways of representing the data. 1) Keypoints format: This is the super inefficient representation that you pointed out in which the data is in a long format where filename, frame, key are duplicated. Initially, I had it load the data this way so users can load the entire json, which includes not just the "pose_keypoints_2d" data but also the "face keypoints", "3d keypoints", and "hand keypoints" which could be of later use. However this is kinda moot b/c the current data we have from Sherlock only has the "pose_keypoints_2d".

2) pose_2d format: This represents the data in (Frame x 75 pose keypoints) format with [Frame, personID] in the index. This is closer to what you are suggesting, and more efficient when you are just representing the pose_2d data.

Here is a comparison of how the file sizes change

import pandas as pd, numpy as np
from sys import getsizeof
def getsizeof_mb(var):
    return np.round(getsizeof(var)/1e+6,2)

new_df_fname = 'output/Sherlock.csv'
col_names = ['fname', 'frame', 'key','keyID', 'personID','value']

df_long = pd.read_csv(new_df_fname, header=None, names=col_names)
print('df_long',df_long.shape, getsizeof_mb(df_long))
df_pose_long = df_long.pat.grab_pose()
print('df_pose_long',df_pose_long.shape, getsizeof_mb(df_pose_long))
df_pose_2d = df_pose_long.pat.grab_person_pose()
print('df_pose_2d',df_pose_2d.shape, getsizeof_mb(df_pose_2d))
# 3D implementation 
max_frame = int(pose_df.groupby('personID').count().max()[0])
num_people = int(len(np.unique([ix[1] for ix in pose_df.index])))
np3d = np.empty((max_frame,75,num_people))
print('3D numpy',np3d.shape, getsizeof_mb(np3d))

Format type	Shape	Size in mb
df_long	(202616, 6)	56.89
df_pose_long	(199950, 6)	57.79
df_pose_2d	(2666, 75)	1.62
3D numpy	(1579, 75, 7)	6.63

df_long is when the json is blindly loaded in the long format, df_pose_long is when you just grab the 2d_pose data (but still in long format), df_pose_2d is when you pivot and represent in (frame x 75 keypoints), and lastly 3d numpy if you represent the full range of frame for all people.

Overall, I wonder if the solution would be to add another loading function for each variable (e.g. load_ pose_keypoints_2d, load_ hand_left_keypoints_2d in which you can simply extract the keypoints of interest straight from the jsons. We could also parallelize loading to speed things up.

Let's definitely chat more about this and other suggestions.

catisf commented 5 years ago

Super relevant points indeed! I had similar questions about what the best way to store the output of the functions I wrote would be and I think that also largely depends on what format we have the information in to start with, so definitely happy to discuss it this afternoon!

meng-du commented 5 years ago

Another alternative is using a sparse matrix (not really sure how it works, just remembered seeing neurosynth using this)... Also found an introduction here

jcheong0428 / py-pat

A lightweight data structure? #1