microsoft / farmvibes-ai

FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability
https://microsoft.github.io/farmvibes-ai/
MIT License
680 stars 118 forks source link

Timeout error #110

Closed click2cloud-Nagaraj closed 12 months ago

click2cloud-Nagaraj commented 1 year ago

Hello @rafaspadilha, As discussed earlier, now I'm able to extract the orchestrator and worker node logs for the Timeout error and attaching the same here. So, the first time we see the error, might reflect at any point of the workflow run on a particular cluster for a particular workflow. Thereafter, with the same earlier configurations or for any other farm on that particular cluster, Timeout error occurs immediately. Restarting the cluster, works as a temporary workaround though.

Herewith, I'm attaching the following information

  1. terravibes-worker logs after the Timeout error.
  2. terravibes-orchestrator logs after the Timeout error.
  3. Output logs for the initial instance of Timeout error.
  4. Output logs for a later run after the Timeout error.

Timeout-error-logs.zip

Edit: While running the same workflow we encountered 6 failures before its successful completion. So, I'm attaching log files for all these failures in below given Failures_log_files.zip

Failures_log_files.zip

rafaspadilha commented 1 year ago

Hi, @click2cloud-Nagaraj. Thank you for sharing the logs. We have seen an increase in timeouts since last release's migration to async.io. We have improved the stability of the cluster and haven't experienced similar issues internally. The fix will be available on the next release.

rafaspadilha commented 1 year ago

Hi, @click2cloud-Nagaraj. Are you still experiencing this issue?

rafaspadilha commented 12 months ago

Closing this issue for now. @click2cloud-Nagaraj, feel free to reopen it if the problem persists.