nasa-nccs-hpda / vhr-cloudmask

VHR Cloud Masking
https://nasa-nccs-hpda.github.io/vhr-cloudmask/
Apache License 2.0
1 stars 0 forks source link

Scope resources needed to process the entire archive of NGA data #65

Closed jordancaraballo closed 8 months ago

jordancaraballo commented 9 months ago

Scope resources needed to process the entire archive of NGA data. Next meeting is on February 2.

How long would it take to run this on-premises? Run from the raw NTIFs. How much labeling? Benchmark for testing?

jordancaraballo commented 8 months ago

I still need to query the postgress database, but from their .gdb file:

Filtering the NGA database there are 4,064,834 multispectral entries, from which there are some entries that are simply “Ingest duplicate – deleted” (506,665 seem to be duplicates).

gdf_mul[gdf_mul['archive_path'] == 'Ingest duplicate - deleted'].shape (506665, 49)

Assuming our current max number of samples is 4,064,834 (should be lower given the deleted duplicates), and at a conservative 4 minutes per strip (using one of the largest scenes I saw in the quick database search), we are seeing a total of 4,064,834 x 4 = 16,259,336 minutes of processing, 271,000 GPU hours of processing. Using 10 nodes, each with 4 V100 GPUs drops that to 6,775 hours or a max of 282 days. While benchmarking some of the NTFs runs I did see some files processed in less than 30 seconds, so the 282 days is definitely on the upper limit. If we can get the discover A100s (assuming 10 nodes), the highest time for prediction was 2 minutes, so 141 days total (again on the higher side since some scenes with the A100s were done in some cases in less than 20 seconds). Processing looks good, writing the COGs to disk takes a bit of CPU time that we might not be able to shorten anymore.