mintproject / MINT-Transformation

MIT License
3 stars 1 forks source link

Group datasets for TopoFlow in Ethiopia (GLDAS and GPM) #41

Open dgarijo opened 4 years ago

dgarijo commented 4 years ago

When the TopoFlow data was transformed, it was grouped by year. This is problematic if you want to run simulations e.g., from mid December to mid Jan.

Can you please run your transformation pipeline to consolidate the 10 years under 1 file? For GPM and GLDAS data. And for the 30/60sec resolution.

We would need these registered in the data catalog as individual datasets (with only 1 resource in this case)

sumwmer commented 4 years ago

Hi @dgarijo, I'm currently working on your issue. Can you provide me with more information on the climate.rts file (contained in here), such as the bounding box size so I can sanity check before I combine everything? My current understanding of the rts file is this is a long 1d numpy array and it's ordered by time stamp, so straight forward concatenation will suffice. Right now we have the scripts to do so but I just want to make sure I'm doing the right thing before I register on data catalog :)

dgarijo commented 4 years ago

@summer7xinting, unfortunately I didn't do the original conversion, so I don't know the details. It was @binh-vu and @minhptx the ones who did the conversion from GLDAS and GPM to the different months. They have all the details about it. Please ask them :)

dgarijo commented 4 years ago

Any updates on this?

dgarijo commented 4 years ago

When doing this, would it be possible to publish on the data catalog the 30 and 60 versions of the data separately? Thanks

sumwmer commented 4 years ago

Hi @dgarijo, sorry about the late update. I'm still working on your issue but the files are really big (two regions took ~250G on my local machine) so I can only do two regions at a time (instead of automating the whole thing). I will finish as soon as I can and let you know 😔

dgarijo commented 4 years ago

Thanks for the heads up. Just to be sure, each region has to be registered separately as well. If my estimate is correct, the final uncompressed climate.rts data should be around 10-12 gigs per region (the aggregated one per year is around 1.2)

sumwmer commented 4 years ago

Hi @dgarijo, here's the uploaded alwero and awash part:

alwero_gldas_2008_2018_30 - 12087202-7dae-4b80-be17-d2bab77d1720 alwero_gldas_2008_2018_60 - 94c77f12-d8a4-4423-accd-e31ecad2304a alwero_gpm_2008_2018_30 - c83f82ab-e97f-4cfa-bf71-c56c62ffa7a5 alwero_gpm_2008_2018_60 - ec1510ad-ee2c-4e7e-b404-15717a38919f awash_gldas_2008_2018_30 - ecf5c924-13f3-43a6-b667-e6534fab2b58 awash_gldas_2008_2018_60 - 32170c4e-b6ae-4358-9ad8-34dcbe9adf0c awash_gpm_2008_2018_30 - 41d23e72-b9de-4bf6-8cf5-ef9c86c8314a awash_gpm_2008_2018_60 - c1382cf2-cf7d-4af5-ad69-e03924553f16

Can you sanity check any one of them and let me know if this is what you want? I'll proceed with other regions in the mean time :)

sumwmer commented 4 years ago

baro_gldas_2008_2018_30 - f14e9d03-2205-41e9-b685-0d9e6d064e6c baro_gldas_2008_2018_60 - d8bc4c9e-521a-4aea-823b-7405c1ca63a8 baro_gpm_2008_2018_30 - 5e5ec399-2fea-4df4-849c-5bf0ddf7f846 baro_gpm_2008_2018_60 - ae836492-d70f-4427-854f-4e9a2b731902 beko_gldas_2008_2018_30 - ed6cdf85-f952-443c-928a-4a6e5b9816ee beko_gldas_2008_2018_60 - d4652be8-cb68-4591-bd56-93db4caa0f81 beko_gpm_2008_2018_30 - ddf51a9e-95d5-4a3b-85f9-b71b6dc2b89e beko_gpm_2008_2018_30 - 11603147-124d-484b-9d28-21ec54ab904d

sumwmer commented 4 years ago

ganale_gldas_2008_2018_30 - 1a33ef2a-5988-4772-8f13-8b58bed8ba0a ganale_gldas_2008_2018_60 - a9750c1b-b8a4-4f5b-b105-ff9d6465c31e ganale_gpm_2008_2018_30 - 59ecd800-ef6a-4e1e-8eab-1e0f7f333a90 ganale_gpm_2008_2018_60 - 446c986e-0fe7-4bdd-959f-6dbcd02e2567 guder_gldas_2008_2018_30 - 6b5e794a-52b0-4c7a-9367-c60b05171d91 guder_gldas_2008_2018_60 - 47bf6cb1-4125-482e-b7b2-3b37ce86cd91 guder_gpm_2008_2018_30 - e9898519-83e5-4e32-9d29-5425f82444cf guder_gpm_2008_2018_60 - a5fde95a-ce4c-45a7-ab6d-3259a4907d22

sumwmer commented 4 years ago

muger_gldas_2008_2018_30 - fef53dbd-36d1-4816-9d5e-489115a6e580 muger_gldas_2008_2018_60 - b7f0b21d-0730-4570-beda-91ca5c06ac41 muger_gpm_2008_2018_30 - 05c43c58-ed42-4830-9b1f-f01059c4b96f muger_gpm_2008_2018_60 - 5c200655-c011-4038-8667-c3a07e7f8fcb

sumwmer commented 4 years ago

Shebelle data keeps failing on my end. 😔 I'll try again later today.

dgarijo commented 4 years ago

@summer7xinting thanks. Any ideas of why the data catalog does not return these datasets in MINT? Are the variables the same as in https://dev.mint.isi.edu/ethiopia/datasets/browse/5cdfcc63-a9ba-46b4-9bee-fd16ee877936/ethiopia? Could it be because you register them as a dataset with 0 resources, instead of datasets with 1 resource?

sumwmer commented 4 years ago

shebelle_gldas_2008_2018_30 - 591c5eff-81f2-4c42-89b3-2b30780233aa shebelle_gldas_2008_2018_60 - c7ffec7a-639e-463c-b639-f03a0ca0aa29

Hi @dgarijo, the Shebelle GPM data keeps failing on my machine (probably due to lack of RAM: each climate.rts is ~20G so aggregating them via numpy sucks up my local resources very quickly) so I hope someone can take up the task from here 😔

sumwmer commented 4 years ago

Hi @dgarijo, I just saw your message. If you check the datasets via data catalog (for instance https://data-catalog.mint.isi.edu/datasets/05c43c58-ed42-4830-9b1f-f01059c4b96f), it does contain one resource. Also I don't have access to dev.mint.isi.edu so I'm not exactly sure what the issue is. I used the old variable registered in here.

dgarijo commented 4 years ago

@summer7xinting thanks! Dan fixed it (before there were 0 resources for some reason). About the Shebelle transformation, we should not be running this in your machine. This should be run on a server. If the data catalog transformation pipeline is just a matter of defining an input and an output format spec, I would expect that we have an API to actually run this on our machines. Food for thought for when we have to run more transformations. Let's keep this issue open for now.

minhptx commented 4 years ago

@dgarijo We have a public transformation server running on vm1.mint.isi.edu but the RAM amount (16GB) is still not enough for this transformation. We are working on integrating the transformation system to submit the jobs to Pegasus backend. Everything should work after the integration.

dgarijo commented 4 years ago

Sounds good. Hopefully we will be able to use it without having to go through GitHub to open a ticket :)