Thoughts - Githubissues

sberryman commented 6 years ago

I came across your article on arXiv today and the timing is incredible. I've been working on capturing a large dataset for analyzing human activity based on estimated skeleton position in video frames. My goal is to perform activity recognition and re-identification. This is a project for fun and a way for me to learn more about DNNs. Below is a summary of the dataset I've gathered so far.

So far I have 14 days worth of security camera video (2 cameras) recorded at 3264×1836x30fps resolution with every frame undistorted during daylight hours (and the ability to record any number of additional days.) I'm in the process of running those frames through openpose using their maximum accuracy configuration. Currently I'm only going to process 3 days worth of video from each camera which equates to 7,052,703 frames and ~10TB of frames stored using JPEG images. I have about 7-8 days left on openpose inference.

In the past I've sampled 2fps of the same video at a reduced resolution and manually labeled the data by creating a grouping of poses (identity/track) which identify the same agent across temporal spans of 5 minutes while also spanning two cameras. Presently I have 10,256 identities (I'm estimating that at least 25-35% of those are duplicate identities as people loiter or enter/exit multiple times per day.

I'm not sure what the breakdown of activities are but they consist of skateboarding, roller blading, running, walking & biking. The videos covers two spans of 7 consecutive days (2017-10-30 through 2017-11-05 and 2017-11-13 through 2017-11-19) which included sunny, cloudy and rainy weather conditions. There are no physical occlusions in the scene but people are frequently occluded by other people throughout their tracks. A vast majority of the identities consist of 150-600 poses, the high end would be about 50 identities loitering resulting in 3,000-9,000 poses each.

I am interested in running your network on the resulting pose data and would be happy to share the results. Any thoughts or suggestions you've learned analyzing the pose data?

yjxiong commented 6 years ago

Generally, you can try running the pre-trained models we release as a starting point.

My concern would be on the annotation of activities. I assume you mention 3000-9000 poses as the segments for activities. If so, I would suggest first build a set of activity categories and label each segment into one or several of them. For the taxonomy, you can refer to action recognition datasets such as UCF101 and Kinetics.

If you have any question, please feel free to talk here. Overall I highly encourage you to publicize the annotated dataset to the community, either by a conference paper or workshop report, so that others can benefit from your efforts and help you improve the quality of data.

sberryman commented 6 years ago

Thanks for the quick response. For 3 consecutive days across two cameras there are 7,052,703 frames and roughly 405,000 poses detected using a sampling rate of 2FPS. That leads me to estimate roughly 5-6M poses for the 3 days at 30FPS. Those are filtered poses/persons with at least 4 joints detected each with a confidence 0.71 or greater.

I would be more than happy to release the dataset in the future. I've never given a talk at an industry conference, written an industry paper or done a workshop report so getting the data out there will require some assistance.

Raw x265 video files are roughly 1.2 TB. The annotated data has some mistakes in it and hasn't been reviewed by anyone else.

I'll write a quick script to sum the length of raw "persons" array in the JSON output from OpenPose to give you an idea of number of poses for a single camera on a single day. I'll update this post once I have that number for you. The inference should be complete in ~8 days.

sberryman commented 6 years ago

Facts

Dates: 2017-10-30 through 2017-11-01
Frames: 7,052,703
- Size: ~10 TB
- Resolution: 3264×1836 (undistorted)
People/Poses:
- (Camera 1 on 2017-10-30 ONLY): 1,225,304
- ~~Processed: 2,116,299 frames and detected 2,263,889 people/poses~~
- ~~Processed: 3,246,262 frames and detected 3,364,805 people/poses~~
- Processed: 4,020,835 frames (~57%) and detected 4,036,250 people/poses

Estimates

Dates:
- 2017-10-30 through 2017-11-05
- 2017-11-13 through 2017-11-19
Frames: 32,912,614
People/Poses: 18-47M (3,150,133 have been detected using 2fps at a low resolution) There was a very busy day on Nov 18th when over 2,200 unique people walked by and were visible on both cameras. Almost 100% of which walked from Camera 2 -> Camera 1. That 1 day would probably make an excellent dataset as the 2,200 people were all wearing pink and walking in tightly formed groups with unaffiliated bikers, walkers, runners, skateboarders, etc weaving in and out.

Notes

People/Poses is the sum of the length of "people" array in each JSON keypoints file output from openpose. No filtering has been performed to remove low quality / incomplete detections._
Video is from a VERY busy boardwalk with a constant flow of people 7 days a week, 365 days a year.

sberryman commented 6 years ago

Example identity

Here is an example of crops of a single unique individual across two cameras. These crops and detected poses were performed using a 2FPS sampling of 2017-10-30 through 2017-11-01. The identity is simply an array of camera, timestamp & pose index tuple.

Green border: Camera 1 Blue border: Camera 2

Estimated location

Here is an example showing the track the below individual took on camera 1 (green borders) overlaid on a background image. The location point is simply the mid-point between the ankles to give a rough idea.

Crops

Track

Classification / Identification Pipeline:

In 5 second spans (10 frames at 2 FPS) according to the wall-clock, identify same person from a single camera. Ex. 08:00:00 - 08:00:05 or 14:05:30 - 14:05:35
In 60 second spans combine/reduce overlapping identities also from the same camera. Ex. 08:00:00 - 08:01:00 or 14:05:00 - 14:06:00
In 5 minute spans combine/reduce overlapping identities, this time spanning both cameras.

In each successive step of the identification process I use the best pose (sum of confidence of each joint) from the previous step as the photo for the identity. For example, in step 2 there are 12 columns (representing each span from step 1) and each column has N rows for each unique identity. I then aligned the same person horizontally like a slot machine.

Classification / Labeling Application

Below is an example of the app used to group poses into unique identities. The red number in the upper left corner was the poseIndex for that frame. Each column is a frame and each row is a pose detected in the frame. The right blue side shows the identity groups and the number of poses in each group.

Sample:

yjxiong commented 6 years ago

Thanks, this is a nice collection of data. Please continue on this nice work. One idea comes to me that you can turn this into a collective effort. But due to the privacy issue related to human faces in your data, it is your call whether to perform this action.

To do this. You can set up a project page and recruit volunteer annotators. We can provide help for running baseline experiments. For the end result, the well-annotated dataset can be contributed to the research community in the form of a paper or tech report.

sberryman commented 6 years ago

@yjxiong thanks for the feedback. I've been reaching out to other authors of papers similar to my work and all have requested making the data public. I've reached out to New Media Rights today and hope to hear back soon. I believe it falls under the fair use act and will know more soon.

Once I (hopefully) get clearance on releasing the video I will need to find a sponsor or supporting organization to host the data. The raw video for 14 days is ~1.2 TB and I personally have no interest in paying AWS S3 transfer fees.

Thanks again for the feedback and I'll keep you updated if you are interested.

sberryman commented 6 years ago

Forgot to update this, OpenPose as well as Mask_RCNN inference has finished.

OpenPose detected 6,627,274 poses of which 1,311,205 have 10 or more joints with >=0.72 confidence.

Here are the counts from Mask_RCNN: https://gist.github.com/sberryman/eef3a873e0e9976162226162d9a7c713

yjxiong commented 6 years ago

Cool. Have you got any chance of running ST-GCN on the extracted joints?

sberryman commented 6 years ago

Getting that project set up right now, I hope to have inference started today or tomorrow at the latest. New Media Rights also agreed to take on my project and I will be working with them to make my dataset public. It doesn't seem to be moving very quickly though.

Update: I have to go back and re-evaluate the poses and put them into tracks at 30fps. Right now I only have it at 2fps resolution so I need to fill in those gaps. I was hoping the guys from AlphaPose would have released their tracker by now but they have not.

sophia-wright-blue commented 6 years ago

just came across this discussion, was wondering if there is an update on releasing the dataset @sberryman ,

open-mmlab / mmskeleton

Thoughts #2

Facts

Estimates

Notes

Example identity

Estimated location

Crops

Track

Classification / Identification Pipeline:

Classification / Labeling Application

Sample: