holoviz-topics / EarthML

Tools for working with machine learning in earth science
https://earthml.holoviz.org
BSD 3-Clause "New" or "Revised" License
94 stars 21 forks source link

Carbon flux: use latitude, longitude and day of year in predictions? #10

Closed stsievert closed 3 years ago

stsievert commented 6 years ago

I'd like a better explanation of the motivation, and some domain knowledge to know what variables to exclude.

stsievert commented 6 years ago

I see latitude + day of year as being most important because those variables determine how much sunlight is received.

Here are some experiments on the clustering algorithm when using latitude, longitude and day of year (DOY) (as well as the measured variable, carbon flux).

Without lat/lon With lat/lon
Without DOY nodoy-nolatlon nodoy-latlon
With DOY doy-nolatlon doy-latlon

The sites are colored by vegetation type, with a legend of legend

More detail on the clustering is present with more data received, naturally. I think this will help with prediction performance. I think longitude should be left out; I can see it as being confounding, though the clustering picks up on it.

jbednar commented 6 years ago

It's hard to see how the longitude will help anything, so I'd vote for not including it.

ebo commented 6 years ago

In many places there are clear env gradients either in lat or lon. In this specific case I am not sure if it would make a difference or if you are working on a general tool.

On Aug 16 2018 8:12 PM, James A. Bednar wrote:

It's hard to see how the longitude will help anything, so I'd vote for not including it.

jbednar commented 6 years ago

Right; over a relatively small region of the globe, lon would be a a perfectly reasonable feature to include. But for a global model, it just seems more confusing than helpful, likely to lead to overfitting and poor generalization, unless recoded as something like "distance from the nearest coastline" or something else more meaningful at a global scale.

ebo commented 6 years ago

Agreed. Lon is more meaningful regionally (like the west coast of the Americas, etc.). I was not sure if this was a specific case or if the code was intended to expose/demonstrate API functionality. Sorry if I threw up a red herring.

On Aug 17 2018 7:14 AM, James A. Bednar wrote:

Right; over a relatively small region of the globe, lon would be a a perfectly reasonable feature to include. But for a global model, it just seems more confusing than helpful, likely to lead to overfitting and poor generalization, unless recoded as something like "distance from the nearest coastline" or something else more meaningful at a global scale.

jbednar commented 6 years ago

Any and all advice welcome!!!