Issue with SpaceEye Pipeline Stalling on Rerun

microsoft / farmvibes-ai

FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Sustainability

https://microsoft.github.io/farmvibes-ai/

MIT License

686 stars 120 forks source link

Issue with SpaceEye Pipeline Stalling on Rerun #163

Closed click2cloud-sagarB closed 6 months ago

click2cloud-sagarB commented 7 months ago

In which step did you encounter the bug?

Workflow execution

Are you using a local or a remote (AKS) FarmVibes.AI cluster?

Local cluster

Bug description

Dear Farmvibes Team, In the most recent release of farmvibes, we've noticed that the SpaceEye pipeline stalls on rerun at tasks "spaceeye.preprocess.cloud.cloud" and "spaceeye.preprocess.cloud.shadow." and all subsequent jobs in the spaceeye pileine were stuck as well. Please let me know if you need anything more from my end.

Steps to reproduce the problem

No response

rafaspadilha commented 7 months ago

Hi, @click2cloud-sagarB. Thank you for raising the issue. A few doubts and requests:

When you say "stalls", what is the status of the tasks? Are both cloud and shadow tasks marked as queued? Could you provide a screenshot of the client.monitor() table?
Is this happening if you run the workflow on a new region/time range that is not cached?
Are you passing the Planetary Computer Key as a parameter to the workflow? We recently changed that and the PC key is now a required parameter.
Could you provide the logs for the orchestrator, cache. and workers so we can investigate what might be happening? They are located in the logs folder of your storage (e.g., ~/.cache/farmvibes-ai/logs).

click2cloud-sagarB commented 7 months ago

Hi @rafaspadilha, Below are the answers to your queries.

Cloud and shadow is in running state, while all subsequent jobs are in the pending status. I have shared the log file and screenshot of client.monitor().
It does not occur when we perform the workflow on a new region/time range that is not cached. When we execute the workflow for the first time in a new region or time period, it completes successfully.
No we do not pass Planetary Computer Key as a parameter to the workflow data_ingestion/spaceeye/spaceeye_interpolation. But I'm not sure whether it was mandatory because if it was, I doubt the workflow would have completed smoothly the first time around.
I have attached log files for your reference logs.zip

rafaspadilha commented 6 months ago

Hey, @click2cloud-sagarB. Looking through the logs, it seems like there was an error during communication between the orchestrator and cache pods.

Please, could you try deleting the cache pod and re-run the workflows?

You can do that with:

$ ~/.config/farmvibes-ai/kubectl delete pods -l app=terravibes-cache

Let me know if this solves the issue.

click2cloud-sagarB commented 6 months ago

Hi @rafaspadilha deleting the cache pod temporarily solves the issue for one rerun but it reoccurs in next rerun.

rafaspadilha commented 6 months ago

As we discussed offline, the logs that you shared have several runs. Please, @click2cloud-sagarB, could you recreate your cluster, delete the logs, reproduce this error and share the new set of logs with us again?

click2cloud-sagarB commented 6 months ago

Hi @rafaspadilha, as reuquested I recreate the updated cluster are rerun the workflow but workflow remians in queued for hours. I am sharing the logs and ss for the same. Time range given : 2 months(datetime(2023, 11, 1), datetime(2023, 12, 31)) Polygon provided in Andrew.txt logs.zip Andrew.txt

rafaspadilha commented 6 months ago

@click2cloud-sagarB we have a new release of FarmVibes. We have a bugfix for the issue you were seeing. Please, when you have some time, could you update your cluster and see if the problem is fixed?

Feel free to reopen this issue if that is the case.