google-deepmind / tapnet

Tracking Any Point (TAP)
https://deepmind-tapir.github.io/blogpost.html
Apache License 2.0
1.31k stars 124 forks source link

ValueError: All `hk.Module`s must be initialized inside an `hk.transform` #80

Closed mysticalfirellama closed 2 months ago

mysticalfirellama commented 9 months ago

I followed all of the instructions titled "Live Demo" of README.md, including the installation of dependencies, and updating the PYTHONPATH.

However, I get the following error when I run python3 ./tapnet/live_demo.py

  File "[...]/tapnet/live_demo.py", line 38, in <module>
    tapir = tapir_model.TAPIR(
  File "[...]/.local/lib/python3.10/site-packages/haiku/_src/module.py", line 139, in __call__
    init(module, *args, **kwargs)
  File "{[...]/.local}/lib/python3.10/site-packages/haiku/_src/module.py", line 433, in wrapped
    raise ValueError(
ValueError: All `hk.Module`s must be initialized inside an `hk.transform`.
Air1000thsummer commented 9 months ago

i'm facing the same issue. Have you fix it?

cdoersch commented 9 months ago

Yup, forgot that even the tapir constructor needs to be called inside a haiku transform. I've attached a working verson of this script, but it's a bit ugly; we may need to work on simplifying this further. live_demo.zip

Air1000thsummer commented 9 months ago

And you, my friend, are true hero

nutsintheshell commented 8 months ago

Yup, forgot that even the tapir constructor needs to be called inside a haiku transform. I've attached a working verson of this script, but it's a bit ugly; we may need to work on simplifying this further. live_demo.zip

Hi, your work is pretty good. But when I try to modify your code, I found the performance decreased a lot. First, I loaded a video, and put the query frame(from another video) as the first frame of the tracked video(synthesis the query frame and the video to be tracked as a single video). Then I put the synthetic video into the online model(include the query frame as the first frame). However, the performance was not good especially at the first few frames. I guess that the reason is that the query frame from another video has a large gap with the video I'd like to track(it is also officially recommanded to use video more than 12 fps, maybe the performance will decrease when two neighboring frames have a large gap). Could you tell me the reason why performance decreased or it's just other details(my code framework is right)? I also try another way. I first load the query frame and modify your compliation part as the part to get the query feature. and then reinit the causal state. and then track these points on the video. Is this right? I 'm new in haiku and other packages. I wonder if it's necessary to learn the haiku, tensorflow, jax and other packages. Very grateful.

nutsintheshell commented 8 months ago

Yup, forgot that even the tapir constructor needs to be called inside a haiku transform. I've attached a working verson of this script, but it's a bit ugly; we may need to work on simplifying this further. live_demo.zip

Here is the video I try to track. The first frame is the query frame, other frames are the tracked frames. The performance is not good at the first few frames and the last few frames. (I show all the visible points and unvisible points into the frames without deleting the unvisible points)

https://github.com/google-deepmind/tapnet/assets/110964890/ed676d6f-c9b3-4e2a-a34a-9c9253260dad

cdoersch commented 8 months ago

Yes, if you want to track across videos, we recommend that you have separate forward passes to extract features and to perform tracking; the model hasn't seen cuts during training, so it tends to get confused by them.

That said, these results look extremely jittery. What points are you trying to track? The model will always struggle to track textureless regions, but it should do OK for points on the objects. I suspect there's something wrong with the way you're using the causal state, but it's hard to say without seeing the code.

(FYI, we're hoping to release a better version of this interface in the next few days, as soon as I get around to updating our colabs)

nutsintheshell commented 8 months ago

Yes, if you want to track across videos, we recommend that you have separate forward passes to extract features and to perform tracking; the model hasn't seen cuts during training, so it tends to get confused by them.

That said, these results look extremely jittery. What points are you trying to track? The model will always struggle to track textureless regions, but it should do OK for points on the objects. I suspect there's something wrong with the way you're using the causal state, but it's hard to say without seeing the code.

(FYI, we're hoping to release a better version of this interface in the next few days, as soon as I get around to updating our colabs)

The points I try to track are seven active points got from two demos(six of them are on the red cube which is going to be grasped). As you say, the problem of my video is maybe the causal state. So I try to separate the two parts. Below is my detail, do you think my thought is right?(by the way, my video shown above made some mistakes about the pixel's index(opencv has a different index order with what I think). The video I show below solve this bug) I try to use causal_state = hk.transform_with_state(lambda : tapir().construct_initial_causal_state( NUM_POINTS, len(query_features.resolutions) - 1 )).apply(params=params, state=state, rng=rng)[0] to init causal state after get query feature from query frame. And then I try to track query points on my camera(maybe this separate the query frame and the video to be tracked).The result gets better. But there's still a gap between my result and your result shown in https://robotap.github.io/

I‘m grateful to see your update of the online version.

nutsintheshell commented 8 months ago

Yes, if you want to track across videos, we recommend that you have separate forward passes to extract features and to perform tracking; the model hasn't seen cuts during training, so it tends to get confused by them.

That said, these results look extremely jittery. What points are you trying to track? The model will always struggle to track textureless regions, but it should do OK for points on the objects. I suspect there's something wrong with the way you're using the causal state, but it's hard to say without seeing the code.

(FYI, we're hoping to release a better version of this interface in the next few days, as soon as I get around to updating our colabs)

This is a demo video which has the seven active points(tracked by tapir)

https://github.com/google-deepmind/tapnet/assets/110964890/3f207f3d-4188-44ff-b24f-b7cf6e04bc5b

Here is my video which separate the two parts(get query feature part and online track part). The blue points are active demo points and the red points are active points to be tracked online. You can ignore the green and blue arrows(I use them for debug)

https://github.com/google-deepmind/tapnet/assets/110964890/364a378b-d9d0-4628-963a-036f0fcb0795

cdoersch commented 8 months ago

The tracks on your demo look pretty reasonable to me, which suggests you're using the code correctly.

It looks like the objects are oriented differently between the demo video and the test-time video. This is a known weakness of TAPIR--there's relatively little in-plane rotation in Kubric, so the model doesn't have very good invariance to it. Also, are you re-using textures across different objects? This may cause problems as well (TAPIR may have spurious matches on the wrong object). In real videos, stochastic textures like wood grain are unlikely to repeat exactly. I expect BootsTAP will improve on both of these; we hope to release a causal BootsTAPIR model sometime in the next few weeks. However, it may not completely solve these problems.

Also, are you plotting occluded points in the test-time video? I'd like to see a version where you don't do this; TAPIR shouldn't be marking those points as visible since they're obviously wrong.

nutsintheshell commented 8 months ago

The tracks on your demo look pretty reasonable to me, which suggests you're using the code correctly.

It looks like the objects are oriented differently between the demo video and the test-time video. This is a known weakness of TAPIR--there's relatively little in-plane rotation in Kubric, so the model doesn't have very good invariance to it. Also, are you re-using textures across different objects? This may cause problems as well (TAPIR may have spurious matches on the wrong object). In real videos, stochastic textures like wood grain are unlikely to repeat exactly. I expect BootsTAP will improve on both of these; we hope to release a causal BootsTAPIR model sometime in the next few weeks. However, it may not completely solve these problems.

Also, are you plotting occluded points in the test-time video? I'd like to see a version where you don't do this; TAPIR shouldn't be marking those points as visible since they're obviously wrong.

Thanks for your reply. First, I re-use the wood texture. I would be grateful if a better model is open. Second, I paint all the points without considering the visibility. The video below only shows points that have a visibility over 0.5. The red points are online tracked points, and the blue points are demo points. I find only one seventh point can be seen.

https://github.com/google-deepmind/tapnet/assets/110964890/4b256770-5064-4588-b3fc-e1d8dd27969a

nutsintheshell commented 8 months ago

The tracks on your demo look pretty reasonable to me, which suggests you're using the code correctly.

It looks like the objects are oriented differently between the demo video and the test-time video. This is a known weakness of TAPIR--there's relatively little in-plane rotation in Kubric, so the model doesn't have very good invariance to it. Also, are you re-using textures across different objects? This may cause problems as well (TAPIR may have spurious matches on the wrong object). In real videos, stochastic textures like wood grain are unlikely to repeat exactly. I expect BootsTAP will improve on both of these; we hope to release a causal BootsTAPIR model sometime in the next few weeks. However, it may not completely solve these problems.

Also, are you plotting occluded points in the test-time video? I'd like to see a version where you don't do this; TAPIR shouldn't be marking those points as visible since they're obviously wrong.

What's more, I find tapir spends more than half an hour to inference a video having only 500 frames?(online version is much faster) I don't know the reason(maybe because I can't use parallel compilation, but I 'm not sure) . could you tell me how to solve the problem?Thanks

cdoersch commented 7 months ago

The original bug about "All hk.Modules must be initialized inside an hk.transform" in live_demo.py should be fixed with the latest push.

Unfortunately this push also includes the update that replaces the deprecated jax.tree_map with the very recently-introduced jax.tree.map, so the codebase now requires a very recent version of jax in order to run. It should be safe to do a find/replace with jax.tree_map to be compatible with older versions; we aren't using any other new jax features.