Cannot get auto_train working

oiluigio commented 3 years ago

I am trying to use the training network on trex, for example adding the option -auto_train but I always get the same error:

[EXCEPTION 16:15:55 gui.cpp:4889] The training process failed. Please check whether you are in the right python environment and check previous error messages.

attached is the full log. Trex was able to find the GPU so I'd say that can't be the problem and looking on forums it seems messages raised by TensorFlow are warnings.

I am using python 3.7.10 and miniconda 4.9.2 on a x86_64 Linux Machine

thanks in advance

giulio

mooch443 commented 3 years ago

Hey,

thank you for the report. This message means that your tracking parameters likely do not work for the problem at hand. It seems you are setting the number of individuals - which is important - but did you check whether it actually found any viable segments? You can see this if you start TRex and there are differently colored bars within the top-bar (after it finishes tracking). If not, then likely there is never a moment in time where all individuals are separated from each other.

There is not much more that I can tell from just looking at the log file, but maybe you can also attach a screenshot of what the tracking looks like in TRex?

-Tristan

oiluigio commented 3 years ago

Hi,

Many thanks for your reply, attached is a screenshot of log + trex GUI. I am not sure what a viable segment should be but looking at the colored contours in the screenshot I'd say I am on the good way? The top bar is full with yellowish lines but I don't see "different" colors. I was able to reproduce this issue with both a GPU trex_debug_server and a non GPU machine .

p.s. I managed to start auto-train (still working to get good results) on both machines for a longer portion of the same video, could it be the reason?

-Giulio

mooch443 commented 3 years ago

Hey, Sorry for the delay. I think you might have forgotten the log? In any case: There seems to be a problem here with assigning what an individual is and what isn't. I am not entirely sure, what the individuals look like - they seem to be the slightly bigger blobs? Make sure that you are only tracking the stuff that is actually an individual. You can specify sizes in blob_size_ranges (press D while you're in TRex to reveal the size metrics etc.). As an example, if individuals usually show up in D view-mode as 0.05, I would go for a value of blob_size_ranges = [[0.01,0.1]].

Before you do that, however, you may try increasing the track_threshold to get rid of some of the noise here. I think reducing noise should be your first priority, if at all possible. This may also be better solved by a different background mode during conversion, or other different filters. Would you be able to share a screenshot of your original video, or a short original video clip?

Edit: I forgot to mention - I meant colors different than yellow. Yellow indicates problems with the tracking. Have a look here: https://trex.run/docs/gui.html#timeline

oiluigio commented 3 years ago

Hi, many thanks for your answer. I may have done some step forward but I haven't managed to get good training results and I am stucked with the following two scenarios:

(1)- I managed to have a timeline showing three coloured rectangles (red green yellow) but this was at the price of setting following paramters: -low threshold (15) -high max blob size (20 while tipical object is 0.3) -small track_max_reassign_time, for example lower than time distance between two consecutive timestamps, which I guess pushes this parameter at extreme as the tracker does not wait at all.

In this case the training stop with an error saying there are not enough consecutive segments and suggest to modify the three parameters I just mentioned. out.log and err.log are attached test_1.out.log test_1.err.log

(2)-If I adjust the three parameters above taking: -higher threshold (60) -lower max blob size -longer track_max_reassign_time the segmented video and plain tracking (no training) looks better, and less noisy, still the timeline is full with yellow thin bar. Indeed I never find the three rectangles as in (1) except for what looks a narrow range around the set of parameters in (1). In this case the training almost completed but stops at 34/37 epoch (after 150/150 first epoch cycle was completed), saying memory is not enough. I am using a 32Gbyte RAM + 32Gbyte Swap machine and tracked the memory usage with syrupy.py. If I understand correctly VSIZE (RAM + SWAP) is around 54 Gbyte before tensorflow raises an error. out.log, err.log and ps.log are attached test_2.ps.log test_2.out.log test_2.err.log

mooch443 commented 3 years ago

Hey!

1) This is likely not a good solution. The parameters are there to provide proper filters so that the machine learning does not have to look at many of the noise blobs that got sorted out by the filters. If you change your parameter values like this, you are likely to end up with 1. consecutive segments as you described that 2. are not actually consecutive (but you tell the program that they are). Giving TRex a bit more freedom may be good, but too much freedom is not good (lest it may take over the world).

2) This sounds more promising, since you are actually remove noise here. Could you provide an output of nvidia-smi at the time of training? It says "device:GPU:0 with 58 MB memory" and it immediately runs out of memory, it seems (after about 20s in .err.log). Your GPU has less memory than mine, but I assumed that 5GB should be enough for this. If this helps, you could try to further limit the number of samples that TRex can push to the GPU by setting gpu_max_sample_gb to something lower.

Let me know if this helps! -Tristan

oiluigio commented 3 years ago

Hey,

Sorry for the late reply! I managed to get the training process complete using -gpu_max_sample_gb 0.2 (any higher produce the same error on my machine). Still if I interpret the log correctly the result of the training is quite bad? I had a visual check and indeed identity is rapidly (a few minutes) lost whenever insects are too crowded. I tried with a 7minutes and 1h video of the same experiments and get similar results. Any suggestions? Attached are the 2 log files,

many thanks

Giulio

trex_train_7m_log.txt trex_train_1h_log.txt

mooch443 commented 3 years ago

Hey,

also late :(

so the first log says this at the beginning: [14:04:18] Fewest samples for an individual is 1 samples - same goes for the second log. This is probably not enough to get good training results. It also says that Some (83%) termite images are too big. Range: [23x23, 173x199] median 83x78 which you can fix by lowering recognition_image_scale or increasing recognition_image_size. This might also mean that there is a lot (83%) noise in your data, so you might want to set recognition_save_training_images to true and have a look at the .npz file that it outputs (containing all the training images). It will be in the output folder called filename+"_validation_data.npz".

If you cannot get more than one sample per individual, you may have to do some manual work after all, or try to get longer consecutive segments.

-Tristan

oiluigio commented 3 years ago

Hi, thanks again for your reply. I managed to drop 'too big images' to 1% using _recognition_imagescale 0.4 and _recognition_imagesize [160,160]. Training score also remarkably increase, but I am afraid it is still far from "hoped standard" and visual check show identity mismatching are present after a few minutes of video. trex_160160_04_log.txt

I tried to have a look to validation images that are put together here:

https://user-images.githubusercontent.com/38915394/119728071-54e8ee00-be73-11eb-9c46-b4c340791467.mp4

they are not all excellent but most of them correspond clearly to one individual. In a few you see two individuals or their reflection on the petri dish wall.

What would you suggest to get longer consecutive segments?

many thanks

Giulio

mooch443 commented 3 years ago

Hi,

thanks again for your reply.

I managed to drop 'too big images' to 1% using _recognition_imagescale 0.4 and _recognition_imagesize [160,160].

This still seems quite large. But at least it does not seem to be dropping too many images. For my work, I usually use 64px or 48px sizes (or size them down so they fit). You can also have a look at the exported images if you set this parameter: https://trex.run/docs/parameters_trex.html#recognition_save_training_images but you know about this one already, I presume. :)

Training score also remarkably increase, but I am afraid it is still far from "hoped standard" and visual check show identity mismatching are present after a few minutes of video.

trex_160160_04_log.txt

I tried to have a look to validation images that are put together here:

https://user-images.githubusercontent.com/38915394/119728071-54e8ee00-be73-11eb-9c46-b4c340791467.mp4

they are not all excellent but most of them correspond clearly to one individual. In a few you see two individuals or their reflection on the petri dish wall.

What would you suggest to get longer consecutive segments?

Yes, this should likely be the goal. It is kind of hard to tell for me whether the sample you sent is from one individual - but if all individuals have that many uninterrupted sequences, then one should expect to get recognition to work. We have had some bad experiences with clonal individuals in the past - it these individuals here are just too similar, then it simply can't work without additional info. For these kinds of animals, of course you could still mark them in some way (size, markings etc.). But this is of course not the point of the software here. In the new version you can try to differentiate between groups of individuals manually - e.g. soldiers and workers. This would half the problem for the visual recognition: track_only_categories will help here (after categorize and Apply). Docs will follow.

many thanks

Giulio

Sorry for the late reply :( hope I still helped! -Tristan

oiluigio commented 3 years ago

Hi,

many thanks for your reply

I realise I was not clear in describing my video, that are all training pics together (for all individuals). There are actually less than 10 images per individual (in average). Here is a new version of the video

https://user-images.githubusercontent.com/38915394/122586496-0b985280-d05d-11eb-90f7-270084968535.mp4

with additional label specifying the id for each image. Just in case that may change your advice.

thanks again

regards

Giulio

mooch443 commented 3 years ago

Hey,

How long are your videos generally, and is the composition of individuals the same - or does the group change? 10 images per individual is very very likely not enough. If your video is long, then there should be a higher chance of getting longer segments (more samples), but this is the goal in any case.

Sorry that I am not being more helpful here! -Tristan

oiluigio commented 3 years ago

Hey,

Videos are actually quite long (hours), the problem seems to be related to the fact that the length of consecutive segments as produced by the plain tracking are not long enough. And yes the the composition is always the same, even if termites spend lot of time close to the Petri-Dish wall which cause several of them to "disappear" from the point of view of Tgrabs, which may explain while consecutive segments are that short.

By the way is there a way to manually correct the plain tracking of individuals, say on a few minutes, and feed that data to the training network. I guess you can using the menu "match" but I could not understand how! Also for some reason, when I open the "match" menu I have access only to the first 34 individuals as the menu goes beyond the bottom edge of the screen.

Many thanks again for your support!

Giulio

mooch443 commented 3 years ago

Hey!

Ah okay, it is a good thing that they're long. Having individuals climb the walls is a common problem, since that makes them go invisible - or at least half invisible to TGrabs. We have the same issue here with locusts that just love to sit there. You can experiment around and see if track_max_speed can be increased or track_trusted_probability can be lowered. This way, you sometimes can get longer segments (if you have v1.1.3 you can hover the segments on the top left to see why they stopped, when you have selected an individual).

The way to go here (with this many individuals) is to go to RAW mode (press D) and click on one of the white dots in the center of individuals. You can then search for an identity you want to assign and click it.

-Tristan

mooch443 / trex

Cannot get auto_train working #41