kristinbranson / APT

Animal Part Tracker
GNU General Public License v3.0
71 stars 16 forks source link

training error? odd DIST plot #408

Open happyqiu opened 1 year ago

happyqiu commented 1 year ago

Hi, I'm using the APT-develop branch, and found that during the training, the DIST panel didn't look right while the Loss might be okay. Besides, after the training, I tried to track another video, but I didn't get any predicted labels. Do you have any ideas about what's happening here?

Thanks! DIST track_img

allenleetc commented 1 year ago

Hi @happyqiu,

Can you please share your project (.lbl) file and the movie you are trying to track? You will probably need to share a link to a cloud service (eg Google Drive) because these files will be too large to directly attach here.

When you track the video, have you looked at the log in the Tracking Monitor? There might be warnings or other messages printed there. If this log is available and you can upload it here that might also be useful.

Thanks!

happyqiu commented 1 year ago

Thanks for your reply. Please find the link below. (video 20 is for tracking.) https://drive.google.com/drive/folders/1gu0R8yqWSzz7APH54vOyIpgPIAOqHa_C?usp=sharing Sorry that I don't have the log messages from the tracking monitor.

On Tue, Nov 1, 2022 at 8:24 AM Allen Lee @.***> wrote:

Hi @happyqiu https://github.com/happyqiu,

Can you please share your project (.lbl) file and the movie you are trying to track? You will probably need to share a link to a cloud service (eg Google Drive) because these files will be too large to directly attach here.

When you track the video, have you looked at the log in the Tracking Monitor? There might be warnings or other messages printed there. If this log is available and you can upload it here that might also be useful.

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/408#issuecomment-1298435535, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHXM5YVXYNSZ33KPBM3WGEDXDANCNFSM6AAAAAARR3VO24 . You are receiving this because you were mentioned.Message ID: @.***>

allenleetc commented 1 year ago

@happyqiu

Strange, if I try tracking with your trained tracker, I get predictions but they are 'garbage' (all in the upper-left corner). Maybe that is why you don't see them?

However, if I retrain, my loss/dist plots look normal and the tracking looks good.

Nothing jumps out yet -- maybe if it's not difficult, try doing a fresh retrain to see if anything changes? (Please save the training log just in case.) So far your training data looks normal so I wonder if it could be something in your environment/platform.

happyqiu commented 1 year ago

That happens... When I tried this .lbl file with a TITAN X GPU computer, it worked alright. But with my own computer (NVIDIA RTX A5000), this problem occurred. Does the GPU matter that much? All other installations are the same on these two computers.

On Tue, Nov 1, 2022 at 5:59 PM Allen Lee @.***> wrote:

@happyqiu https://github.com/happyqiu

Strange, if I try tracking with your trained tracker, I get predictions but they are 'garbage' (all in the upper-left corner). Maybe that is why you don't see them?

However, if I retrain, my loss/dist plots look normal and the tracking looks good.

Nothing jumps out yet -- maybe if it's not difficult, try doing a fresh retrain to see if anything changes? (Please save the training log just in case.) So far your training data looks normal so I wonder if it could be something in your environment/platform.

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/408#issuecomment-1299279223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHQLOAA3PHKPWDMZKA3WGGHDJANCNFSM6AAAAAARR3VO24 . You are receiving this because you were mentioned.Message ID: @.***>

allenleetc commented 1 year ago

It looks like a compatibility issue with the A5000 may be possible. In develop we are on tf1.15 and see eg

https://discuss.tensorflow.org/t/tensorflow-and-cuda-support-for-latest-nvida-a5000-ampere-gpu/3886 https://embea.de/blog/?p=114

@mkabra could @happyqiu have Ampere compatibility issues even if they switch to the multianimal branch? One of these links seems to suggest that tf2.4 is required.

In general the specific GPU can potentially matter as in eg https://github.com/kristinbranson/APT/issues/365.

mkabra commented 1 year ago

Yes, even on multi-animal branch there will be compatibility issues with Ampere architectures.

@happyqiu, we are looking at updating the software to work with the newer architectures, but it could take a few months to do that.

Mayank


From: Allen Lee @.> Sent: Wednesday, November 2, 2022 8:13 AM To: kristinbranson/APT @.> Cc: Kabra, Mayank @.>; Mention @.> Subject: Re: [kristinbranson/APT] training error? odd DIST plot (Issue #408)

External Email: Use Caution

It looks like a compatibility issue with the A5000 may be possible. In develop we are on tf1.15 and see eg

https://discuss.tensorflow.org/t/tensorflow-and-cuda-support-for-latest-nvida-a5000-ampere-gpu/3886https://urldefense.com/v3/__https://discuss.tensorflow.org/t/tensorflow-and-cuda-support-for-latest-nvida-a5000-ampere-gpu/3886__;!!Eh6p8Q!DtmY6L8IGIcwlRGZ_DJjJyN6xtRRl8SO2fpUvtITYczwrEvH1pFQwyms36Zfz7z8aGBOWEl6xP0sa30Lkpx5yEun-I8$ https://embea.de/blog/?p=114https://urldefense.com/v3/__https://embea.de/blog/?p=114__;!!Eh6p8Q!DtmY6L8IGIcwlRGZ_DJjJyN6xtRRl8SO2fpUvtITYczwrEvH1pFQwyms36Zfz7z8aGBOWEl6xP0sa30Lkpx5xcSPZLg$

@mkabrahttps://urldefense.com/v3/__https://github.com/mkabra__;!!Eh6p8Q!DtmY6L8IGIcwlRGZ_DJjJyN6xtRRl8SO2fpUvtITYczwrEvH1pFQwyms36Zfz7z8aGBOWEl6xP0sa30Lkpx5LPvVKSU$ could @happyqiuhttps://urldefense.com/v3/__https://github.com/happyqiu__;!!Eh6p8Q!DtmY6L8IGIcwlRGZ_DJjJyN6xtRRl8SO2fpUvtITYczwrEvH1pFQwyms36Zfz7z8aGBOWEl6xP0sa30Lkpx5UPgTruU$ have Ampere compatibility issues even if they switch to the multianimal branch? One of these links seems to suggest that tf2.4 is required.

In general the specific GPU can potentially matter as in eg #365https://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/365__;!!Eh6p8Q!DtmY6L8IGIcwlRGZ_DJjJyN6xtRRl8SO2fpUvtITYczwrEvH1pFQwyms36Zfz7z8aGBOWEl6xP0sa30Lkpx5Ty-FlXc$.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/408*issuecomment-1299487504__;Iw!!Eh6p8Q!DtmY6L8IGIcwlRGZ_DJjJyN6xtRRl8SO2fpUvtITYczwrEvH1pFQwyms36Zfz7z8aGBOWEl6xP0sa30Lkpx5R0Wxot0$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY4BJHGHHPBEDUUGNCDWGHIOFANCNFSM6AAAAAARR3VO24__;!!Eh6p8Q!DtmY6L8IGIcwlRGZ_DJjJyN6xtRRl8SO2fpUvtITYczwrEvH1pFQwyms36Zfz7z8aGBOWEl6xP0sa30Lkpx5JtjhnD0$. You are receiving this because you were mentioned.Message ID: @.***>