kristinbranson / APT

Animal Part Tracker
GNU General Public License v3.0
71 stars 16 forks source link

Training stuck at preprocessing #414

Open pmussp opened 6 days ago

pmussp commented 6 days ago

Hi,

I'm trying to train APT using the dlc network, but the training window gets stuck on "Preprocessing" without throwing any errors.

Screenshot

A docker job is created, but it doesn't seem like the GPU is being used (running nvidia-smi shows that only 200 Mb is being used by MATLAB). However, the docker backend passes the APT test for GPU access.

These are the specs I'm using:

Ubuntu 20.04.6 LTS MATLAB 2023b NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER] Driver Version: 550.54.15 CUDA 11.8 APT main branch Docker backend (latest)

I've attached the log file and the .lbl file. Any help debugging this issue would be greatly appreciated!

Best, Peter

20240625T155325view0_20240625T155325_tdptrx_new.log lbl_file.zip

mkabra commented 4 days ago

Hi Peter,

Thanks for attaching the screenshot and the log file. I assume it has been stuck on preprocessing for more than 5-10 minutes. Can you reopen the project, start the training, let it run for an hour, then use "Bundle working directory" from the File menu and send me the tarball (or the zip file)? Looking at the log files, the training is running, but the monitor is probably not getting updated.

Mayank


From: pmussp @.> Sent: Wednesday, June 26, 2024 2:09 AM To: kristinbranson/APT @.> Cc: Subscribed @.***> Subject: [kristinbranson/APT] Training stuck at preprocessing (Issue #414)

External Email: Use Caution

Hi,

I'm trying to train APT using the dlc network, but the training window gets stuck on "Preprocessing" without throwing any errors.

Screenshot.png (view on web)https://urldefense.com/v3/__https://github.com/kristinbranson/APT/assets/28634244/2b56e71b-2f04-4b14-9026-3953c0c2e9af__;!!Eh6p8Q!DtrHJvGh7GgTAKaWx3vZ1x7-ziahiUb4KrJhPRgH_itVIYIHIeUMzpyLGpBt3n29T0BjR6cF8TVI54Pv6Jq4ZSy6kSU$

A docker job is created, but it doesn't seem like the GPU is being used (running nvidia-smi shows that only 200 Mb is being used by MATLAB). However, the docker backend passes the APT test for GPU access.

These are the specs I'm using:

Ubuntu 20.04.6 LTS MATLAB 2023b NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER] Driver Version: 550.54.15 CUDA 11.8 APT main branch Docker backend (latest)

I've attached the log file and the .lbl file. Any help debugging this issue would be greatly appreciated!

Best, Peter

20240625T155325view0_20240625T155325_tdptrx_new.loghttps://urldefense.com/v3/__https://github.com/user-attachments/files/15978376/20240625T155325view0_20240625T155325_tdptrx_new.log__;!!Eh6p8Q!DtrHJvGh7GgTAKaWx3vZ1x7-ziahiUb4KrJhPRgH_itVIYIHIeUMzpyLGpBt3n29T0BjR6cF8TVI54Pv6Jq4y7Bbtgo$ lbl_file.ziphttps://urldefense.com/v3/__https://github.com/user-attachments/files/15978425/lbl_file.zip__;!!Eh6p8Q!DtrHJvGh7GgTAKaWx3vZ1x7-ziahiUb4KrJhPRgH_itVIYIHIeUMzpyLGpBt3n29T0BjR6cF8TVI54Pv6Jq4qCBFxIE$

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/414__;!!Eh6p8Q!DtrHJvGh7GgTAKaWx3vZ1x7-ziahiUb4KrJhPRgH_itVIYIHIeUMzpyLGpBt3n29T0BjR6cF8TVI54Pv6Jq4l0V2Unw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY6PNYC5HI262RWS5ADZJHIPXAVCNFSM6AAAAABJ4SFZT2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3TGNRRHEZDCMQ__;!!Eh6p8Q!DtrHJvGh7GgTAKaWx3vZ1x7-ziahiUb4KrJhPRgH_itVIYIHIeUMzpyLGpBt3n29T0BjR6cF8TVI54Pv6Jq48HFMyc0$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

pmussp commented 4 days ago

Hi Mayank,

Here is a zip file of the working directory after running for ~1 hour.

working_dir.zip

mkabra commented 3 days ago

Ok yeah, it is not training at all. Can you send the movie used in the project (/mnt/data/peter_data/2024_05_24/experiment_01/experiment_01_20240524_084400_behavior_camera_video_undist.avi) and the git version of the APT you use so that I can recreate the issue?

Mayank


From: pmussp @.> Sent: Thursday, June 27, 2024 7:49 PM To: kristinbranson/APT @.> Cc: Kabra, Mayank @.>; Comment @.> Subject: Re: [kristinbranson/APT] Training stuck at preprocessing (Issue #414)

External Email: Use Caution

Hi Mayank,

Here is a zip file of the working directory after running for ~1 hour.

working_dir.ziphttps://urldefense.com/v3/__https://github.com/user-attachments/files/16015490/working_dir.zip__;!!Eh6p8Q!FFsr_y5mjXh4tWJUplr91eorA0Xr3KFAI_LTEmTW9pZPZz8QvgC5YJgS92S8GWUM43ycDxnwr3lMioRap3nsyxJXl_E$

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/414*issuecomment-2194831686__;Iw!!Eh6p8Q!FFsr_y5mjXh4tWJUplr91eorA0Xr3KFAI_LTEmTW9pZPZz8QvgC5YJgS92S8GWUM43ycDxnwr3lMioRap3nsoG4b-y8$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY3OAQKYODGVS7JFQYTZJQNQHAVCNFSM6AAAAABJ4SFZT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJUHAZTCNRYGY__;!!Eh6p8Q!FFsr_y5mjXh4tWJUplr91eorA0Xr3KFAI_LTEmTW9pZPZz8QvgC5YJgS92S8GWUM43ycDxnwr3lMioRap3ns9qY03dU$. You are receiving this because you commented.Message ID: @.***>