microsoft / farmvibes-ai

FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability
https://microsoft.github.io/farmvibes-ai/
MIT License
680 stars 118 forks source link

Issue with SpaceEye Pipeline Stalling on Rerun #163

Closed click2cloud-sagarB closed 4 months ago

click2cloud-sagarB commented 5 months ago

In which step did you encounter the bug?

Workflow execution

Are you using a local or a remote (AKS) FarmVibes.AI cluster?

Local cluster

Bug description

Dear Farmvibes Team, In the most recent release of farmvibes, we've noticed that the SpaceEye pipeline stalls on rerun at tasks "spaceeye.preprocess.cloud.cloud" and "spaceeye.preprocess.cloud.shadow." and all subsequent jobs in the spaceeye pileine were stuck as well. Please let me know if you need anything more from my end.

Steps to reproduce the problem

No response

rafaspadilha commented 5 months ago

Hi, @click2cloud-sagarB. Thank you for raising the issue. A few doubts and requests:

click2cloud-sagarB commented 5 months ago

Hi @rafaspadilha, Below are the answers to your queries.

1

rafaspadilha commented 5 months ago

Hey, @click2cloud-sagarB. Looking through the logs, it seems like there was an error during communication between the orchestrator and cache pods.

Please, could you try deleting the cache pod and re-run the workflows?

You can do that with:

$ ~/.config/farmvibes-ai/kubectl delete pods -l app=terravibes-cache

Let me know if this solves the issue.

click2cloud-sagarB commented 5 months ago

Hi @rafaspadilha deleting the cache pod temporarily solves the issue for one rerun but it reoccurs in next rerun.

rafaspadilha commented 5 months ago

As we discussed offline, the logs that you shared have several runs. Please, @click2cloud-sagarB, could you recreate your cluster, delete the logs, reproduce this error and share the new set of logs with us again?

click2cloud-sagarB commented 5 months ago

Hi @rafaspadilha, as reuquested I recreate the updated cluster are rerun the workflow but workflow remians in queued for hours. I am sharing the logs and ss for the same. Time range given : 2 months(datetime(2023, 11, 1), datetime(2023, 12, 31)) Polygon provided in Andrew.txt logs.zip Andrew.txt

1

rafaspadilha commented 4 months ago

@click2cloud-sagarB we have a new release of FarmVibes. We have a bugfix for the issue you were seeing. Please, when you have some time, could you update your cluster and see if the problem is fixed?

Feel free to reopen this issue if that is the case.