harvitronix / five-video-classification-methods

Code that accompanies my blog post outlining five video classification methods in Keras and TensorFlow
https://medium.com/@harvitronix/five-video-classification-methods-implemented-in-keras-and-tensorflow-99cad29cc0b5
MIT License
1.17k stars 479 forks source link

Useful? #68

Open lesreaper opened 6 years ago

lesreaper commented 6 years ago

Ok, I'm going to try and be productive here on the questions, though I am a little frustrated to say the least.

Trying to get it to do a simple prediction in the playground-add-synthethic branch. Going real simple, just trying to get it to make a simple prediction and spit me out a number or string of the class. Feeding the model.predict a numpy array of the shape (1, 1, 50, 224, 224, 3). The 50 frames is because of what I'm using in my training set adjustments.

I get the numpy array to evaluate.py:

    x = []
    y = []
    for row in images_array:
        for sample in row[0][0]:
            x.append(sample)
        # for sample in row[1]:
        #     y.append(sample)
    eval_x = np.array(x)
    # eval_y = np.array(y)

    predictions = model.predict(eval_x, verbose=1, batch_size=8)
    predictions_index = predictions.argmax(axis=-1)

I don't even know how it's going to give me a class, but I'm passing the numpy array right now anyway. This is the error I'm getting:

ValueError: Error when checking : expected lstm_1_input to have shape (40, 2048) but got array with shape (224, 3)

I'm using lstm as the model name. I don't know where to change the frames for the shape, or why it's expecting a flattened 2048 node model. How would I even get my 224 x 224 down that far? I'm confused and almost ready to throw in the towel on this whole thing, but I feel like I'm really close. Argh!

harvitronix commented 6 years ago

Hi @lesreaper. I'm sorry my code is causing you issues.

Out of curiosity, why are you using the playground branch? That's an experimental/future version that is not really ready for production, and isn't compatible with the master branch. Is there functionality there that you aren't seeing on the main branch that you need?

If you're wanting to use the LSTM in the master branch, you first need to extract feature from your images using a convnet. That's what that (40, 2048) is all about. It's 40 frames of 2048-d arrays. Those 2048-d arrays are the output of one of the CNN layers.

Apologies again that my code is confusing and/or frustrating. Hope I can help get you back on track with it.

lesreaper commented 6 years ago

Hey @harvitronix, thanks for getting back to me so fast! I apologize if I came off as ungrateful, I know you're putting this out for free and I appreciate it.

I'm using the playground branch because it's the only one that makes an attempt at a demo file to classify a video. The main branch is only for training it seems. I did train my models using that branch, which after some minor changes and adding two of my own classes, worked great I believe.

I figured out the 2048 last night as being part of the feature extraction. How do I get that into the input array for the prediction?

For example, for prediction, I grabbed the sequence of images, converted them to a numpy array and then stacked them along the axis, so it has a shape of (40). How do I get the (,2048) in there to feed to the model.predict()? I thought those features would have been saved in the model and then loaded up in model = load_model(path_to_model)

harvitronix commented 6 years ago

Assuming that you've already extracted the features from your prediction sequences, since you would have needed to do that for training, now the problem to solve is: The new branch with the demo-type script does not know the concept of separated feature extraction. The models in that branch are intended to be end-to-end. Apologies for the confusion there.

So the best solution for your problem, I think, is to create a demo-type script for the master branch. This demo would look for extracted features from the files you're feeding it, and use those features to build the input of shape (n_frames, 2048) for the model to predict on. Unfortunately converting your sequences to arrays and stacking them will not produce the processed features you need to make predictions.

Let me see if I can whip one up tonight. I know many people would benefit from such a tool.

lesreaper commented 6 years ago

Great, thanks, I'll take a look at this tonight as well. My goal is to be able to chunk it (option #4) a video stream in real time.

So, you're saying to take each image in the sequence, feed it into the extractor, get the 2048 features for that sequence, build a new tuple with the sequence of the np array of the data and the features of that set, and then feed that into the model.predict?

opaserin commented 6 years ago

Hi harvitronix and lesreaper, any progress with that demo script? I've found this work really helpful so far and would like to be able to run a test video as well.

lesreaper commented 6 years ago

I haven't gotten much closer yet. Still going.

harvitronix commented 6 years ago

See #70

lesreaper commented 6 years ago

Thanks @harvitronix , I'll try it out next! I had moved on to a PyTorch library doing the same thing, but they have inferencing problems of their own. It seems this is an area of Deep Learning that could definitely use some attention long term!

harvitronix commented 6 years ago

For sure. I'd be curious how you're approaching this in PyTorch. Been planning to spend some time learning it but haven't yet.

lesreaper commented 6 years ago

This is the PyTorch library. I ran into issues because I'm adding my own class to the UCF101 dataset. You can see their whole implementation: https://github.com/kenshohara/3D-ResNets-PyTorch

JasonHe78 commented 6 years ago

My English is not very good, but generally speaking, it is very meaningful to understand this project. Has anyone translated it into Chinese?

Azier777 commented 6 years ago

I liked it. This whole thingis awesome