The story / Current behavior
In the last step of weather-mv pipeline, the ingestion transform, each worker is ingesting its particular feature-collection. Our logic ensures that there is a room for a feature collection ingestion task in the earth engine task list & for that, each worker is checking if the task list has a room for its particular feature collection, and it is checking continuously, at a certain interval. This is the point, where we can optimize.
Suggested changes
In the ingestion transform, when it's feature collections, use flatMap to list all the feature collections coming from the above transform. Then pass this list as a single item to a ParDo function (meaning: run just a single worker) and this worker is now responsible to ingest all the feature collections. It does the same thing that current logic does. The benefit is, instead of all the workers checking for a room in the EE task list, only one worker is checking. Additionally, we can make the waiting time even more, and when it finds room, it creates tasks at once. For instance, the worker checks at one minute, and if it finds 5 vacancy, it creates 5 tasks at once, now that it has access to the entire feature collection list.
Benefits
Effectively, less load on the earthengine, less usage of EE query quota.
Less workers running (or waiting) at a time, which is the main benefit, it will save huge memory power, better for the environment :) i.e. NEXRAD pipeline ran for 3 days with constant 2TiB+ memory consumption, which could go down to 8 GiBs.
The story / Current behavior In the last step of weather-mv pipeline, the ingestion transform, each worker is ingesting its particular feature-collection. Our logic ensures that there is a room for a feature collection ingestion task in the earth engine task list & for that, each worker is checking if the task list has a room for its particular feature collection, and it is checking continuously, at a certain interval. This is the point, where we can optimize.
Suggested changes In the ingestion transform, when it's feature collections, use flatMap to list all the feature collections coming from the above transform. Then pass this list as a single item to a ParDo function (meaning: run just a single worker) and this worker is now responsible to ingest all the feature collections. It does the same thing that current logic does. The benefit is, instead of all the workers checking for a room in the EE task list, only one worker is checking. Additionally, we can make the waiting time even more, and when it finds room, it creates tasks at once. For instance, the worker checks at one minute, and if it finds 5 vacancy, it creates 5 tasks at once, now that it has access to the entire feature collection list.
Benefits