Confusing use of dynamic and static inputs

kvantricht commented 1 year ago

As per the paper, topography and location are static variables, not having a temporal dimension

However, as per the encoder itself, topography is part of x, while latlons is a separate input: https://github.com/nasaharvest/presto/blob/main/presto/presto.py#L390:L398

In practice it means the user needs to duplicate altitude/slope along the temporal dimension, e.g. to be able to feed it to construct_single_presto_input which is rather confusing. What is the reasoning behind this?

gabrieltseng commented 1 year ago

Hi @kvantricht ;

Yes - I agree this is a confusing part of the codebase (apologies). The motivation behind this is that the latlon token never gets masked, and the SRTM token does - I wanted to group all continuous values which would get masked into x.

The trade off is that this means the SRTM values must be duplicated - if another static variable were to be added, it would make sense to split this into dynamic and static inputs. If you feel this would improve the clarity of the model, I'd be happy to give this a shot now.

kvantricht commented 1 year ago

@gabrieltseng as we discussed, it could be useful at some point to try out mono-temporal land cover as a static input or any other scalar value (thinking of soil properties e.g.). It should work by indeed for all these inputs also replicating them along the temporal dimension to make them fit x but I would assume that indeed splitting dynamic and static in the model where the user does not have to include a (matching) temporal dimension for static inputs is less confusing.

By the way, as for latlons, this means that location information is compulsory and we wouldn't be able to compute embeddings for location-unaware inputs?

gabrieltseng commented 1 year ago

Yes; I agree. If we add another static-in-time input, I will rewrite the inputs to be x_dynamic and x_static. I'll leave this issue open until then - one reason I haven't done this yet is that its a bit quicker (in terms of training time) to mask the SRTM values then to split them from the dynamic data.

And yes, the model doesn't natively handle missing location-unaware inputs. Do you have a use case where you would want to pass location-unaware inputs to the model?

kvantricht commented 1 year ago

I understand that training time is an important factor to take into account! So definitely, if that is a strong argument to keep them together, it could make sense. Just for my understanding: isn't this something you could still do under the hood, even if from user perspective the inputs would be separated?

As for the location-unaware inputs: mostly thinking of situations where data privacy could result in the sharing of training data with the inputs but stripped from location information. It might be a borderline case though.

On the other hand: from your experience, do you think the location impact on the embeddings is of such nature that geographical bias in training data could result in classification artefacts in downstream tasks? I could imagine a case where you want to train a classifier on Presto embeddings which are totally freed from location awareness. But also in this case, maybe too borderline.

gabrieltseng commented 1 year ago

isn't this something you could still do under the hood

Yes, in principle. However, since SRTM is exported alongside all the other data in order to do it cleanly we have to separate it out - this is what incurs the computational cost. So this penalty is only for SRTM, and its an artifact of how we export the data. This is why the introduction of any new dataset would definitely justify the re-write.

Do you think the location impact on the embeddings is of such nature that geographical bias in training data could result in classification artefacts in downstream tasks?

We did a few (not thorough) experiments removing the latlon token, and saw a performance decrease. We didn't pursue it at the time because we couldn't see a scenario where the latlon token would be unused during training - the privacy concern is fair, but I haven't yet encountered it (especially when using S2-scale data). If this is a concern, we could definitely investigate it more though.

kvantricht commented 1 year ago

We did a few (not thorough) experiments removing the latlon token, and saw a performance decrease. We didn't pursue it at the time because we couldn't see a scenario where the latlon token would be unused during training

Wanted to get back to this, as we were discussing internally (mostly triggered by this new preprint). So suppose in a crop mapping task you have training data on two particular crops in one country, but only on one of these crops in another country while let's assume theses crops look spectrally exactly the same in both countries. Your downstream classifier would likely learn a strong predictive power of location with respect to the country that lacks training data on one of the crops and therefore never predict the other crop here while it's actually there. Isn't this a case where the mandatory location information in Presto could do harm and where you might benefit from location-unaware embeddings? What's your view on that?

nasaharvest / presto

Confusing use of dynamic and static inputs #17