Open sberryman opened 6 years ago
Generally, you can try running the pre-trained models we release as a starting point.
My concern would be on the annotation of activities. I assume you mention 3000-9000 poses as the segments for activities. If so, I would suggest first build a set of activity categories and label each segment into one or several of them. For the taxonomy, you can refer to action recognition datasets such as UCF101 and Kinetics.
If you have any question, please feel free to talk here. Overall I highly encourage you to publicize the annotated dataset to the community, either by a conference paper or workshop report, so that others can benefit from your efforts and help you improve the quality of data.
Thanks for the quick response. For 3 consecutive days across two cameras there are 7,052,703 frames and roughly 405,000 poses detected using a sampling rate of 2FPS. That leads me to estimate roughly 5-6M poses for the 3 days at 30FPS. Those are filtered poses/persons with at least 4 joints detected each with a confidence 0.71 or greater.
I would be more than happy to release the dataset in the future. I've never given a talk at an industry conference, written an industry paper or done a workshop report so getting the data out there will require some assistance.
Raw x265 video files are roughly 1.2 TB. The annotated data has some mistakes in it and hasn't been reviewed by anyone else.
I'll write a quick script to sum the length of raw "persons" array in the JSON output from OpenPose to give you an idea of number of poses for a single camera on a single day. I'll update this post once I have that number for you. The inference should be complete in ~8 days.
Here is an example of crops of a single unique individual across two cameras. These crops and detected poses were performed using a 2FPS sampling of 2017-10-30 through 2017-11-01. The identity is simply an array of camera, timestamp & pose index tuple.
Green border: Camera 1 Blue border: Camera 2
Here is an example showing the track the below individual took on camera 1 (green borders) overlaid on a background image. The location point is simply the mid-point between the ankles to give a rough idea.
In each successive step of the identification process I use the best pose (sum of confidence of each joint) from the previous step as the photo for the identity. For example, in step 2 there are 12 columns (representing each span from step 1) and each column has N rows for each unique identity. I then aligned the same person horizontally like a slot machine.
Below is an example of the app used to group poses into unique identities. The red number in the upper left corner was the poseIndex for that frame. Each column is a frame and each row is a pose detected in the frame. The right blue side shows the identity groups and the number of poses in each group.
Thanks, this is a nice collection of data. Please continue on this nice work. One idea comes to me that you can turn this into a collective effort. But due to the privacy issue related to human faces in your data, it is your call whether to perform this action.
To do this. You can set up a project page and recruit volunteer annotators. We can provide help for running baseline experiments. For the end result, the well-annotated dataset can be contributed to the research community in the form of a paper or tech report.
@yjxiong thanks for the feedback. I've been reaching out to other authors of papers similar to my work and all have requested making the data public. I've reached out to New Media Rights today and hope to hear back soon. I believe it falls under the fair use act and will know more soon.
Once I (hopefully) get clearance on releasing the video I will need to find a sponsor or supporting organization to host the data. The raw video for 14 days is ~1.2 TB and I personally have no interest in paying AWS S3 transfer fees.
Thanks again for the feedback and I'll keep you updated if you are interested.
Forgot to update this, OpenPose as well as Mask_RCNN inference has finished.
OpenPose detected 6,627,274 poses of which 1,311,205 have 10 or more joints with >=0.72 confidence.
Here are the counts from Mask_RCNN: https://gist.github.com/sberryman/eef3a873e0e9976162226162d9a7c713
Cool. Have you got any chance of running ST-GCN on the extracted joints?
Getting that project set up right now, I hope to have inference started today or tomorrow at the latest. New Media Rights also agreed to take on my project and I will be working with them to make my dataset public. It doesn't seem to be moving very quickly though.
Update: I have to go back and re-evaluate the poses and put them into tracks at 30fps. Right now I only have it at 2fps resolution so I need to fill in those gaps. I was hoping the guys from AlphaPose would have released their tracker by now but they have not.
just came across this discussion, was wondering if there is an update on releasing the dataset @sberryman ,
I came across your article on arXiv today and the timing is incredible. I've been working on capturing a large dataset for analyzing human activity based on estimated skeleton position in video frames. My goal is to perform activity recognition and re-identification. This is a project for fun and a way for me to learn more about DNNs. Below is a summary of the dataset I've gathered so far.
So far I have 14 days worth of security camera video (2 cameras) recorded at 3264×1836x30fps resolution with every frame undistorted during daylight hours (and the ability to record any number of additional days.) I'm in the process of running those frames through openpose using their maximum accuracy configuration. Currently I'm only going to process 3 days worth of video from each camera which equates to 7,052,703 frames and ~10TB of frames stored using JPEG images. I have about 7-8 days left on openpose inference.
In the past I've sampled 2fps of the same video at a reduced resolution and manually labeled the data by creating a grouping of poses (identity/track) which identify the same agent across temporal spans of 5 minutes while also spanning two cameras. Presently I have 10,256 identities (I'm estimating that at least 25-35% of those are duplicate identities as people loiter or enter/exit multiple times per day.
I'm not sure what the breakdown of activities are but they consist of skateboarding, roller blading, running, walking & biking. The videos covers two spans of 7 consecutive days (2017-10-30 through 2017-11-05 and 2017-11-13 through 2017-11-19) which included sunny, cloudy and rainy weather conditions. There are no physical occlusions in the scene but people are frequently occluded by other people throughout their tracks. A vast majority of the identities consist of 150-600 poses, the high end would be about 50 identities loitering resulting in 3,000-9,000 poses each.
I am interested in running your network on the resulting pose data and would be happy to share the results. Any thoughts or suggestions you've learned analyzing the pose data?