mikeskaug / prithvi-change-detection

Using the Prithvi geospatial foundation model for change detection
0 stars 0 forks source link

Input pre-processing #1

Open mikeskaug opened 5 months ago

mikeskaug commented 5 months ago

Hi all. I'm still in the beginning stages of exploring the data and getting things setup to train a segmentation head on top of the pre-trained Prithvi encoder. There is still a lot I need to do before I'm ready to start training and evaluating, but I have a few questions about preparing the input data that you might be able to help me with.

  1. Downsampling. The xBD images are 1024x1024 3-channel (RGB) geotiffs. The Prithvi model expects images of size 224x224. My thought was to downsample the images to 224x224 using bilinear interpolation. The result will still be higher resolution (~1 m) than the HLS data (~30 m). Do you think this is the right approach?
  2. Missing IR bands. Prithvi was trained on imagery with 6 bands but the xBD data only has 3 bands. What should I do about the missing bands in the xBD data? So far I have filled the 3 missing bands with zeros, but that seems crude. Is there a better way to handle the missing bands?
  3. Band ordering. In the Prithvi paper, it says the the model was trained on sentinel bands B02, B03, B04, B8A, B11 and B12. But in the Prithvi config file in https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M/Prithvi_100M_config.yaml, it says bands B02, B03, B04, B05, B06 and B07. Do you know which of these is correct?
  4. Normalization statistics. I need to standardize the input pixels values (val - mean / std). Because this is a new data set, I assume I need to recompute the mean and std statistics. Were the values provided with the Prithvi model computed over the full training data set?
mikeskaug commented 5 months ago

@blumenstiel I was wondering if you have any input on these questions. I can still make progress on other things but I want to confirm that I haven't made a bad assumption about these before getting to large scale training.

blumenstiel commented 5 months ago
  1. You probably get better results if you keep the high-res data. You can either change the input size of the model to 1024 as the positional embedding is just computed and you have to delete the pos embeddings weights anyway before loading (which would increase your memory requirements during training). Alternatively, you could split the data in 16 256x256 samples.
  2. I got the best results with simply dropping the IR channels. Filling them with zeros often leads to worse results. Be aware that during weight loading you have to change the order of the first three channels as Prithvi is trained in the order BGR.
  3. The band names between S2 and Landsat are not different. We used the Landsat names and B05, B06 and B07 have a similar wave length as B8A, B11 and B12 from S2.
  4. I don't know which scaling would work best. By default, I would use the values from the pre-training (values of the pre-training training data). The training data is in reflectance, so you need to scale them to uint8 for the RGB data. But I am not sure if a scaling with value / 10000 * 255 works good or if the xBD values would work better. Maybe try both.
mikeskaug commented 5 months ago
  • You probably get better results if you keep the high-res data. You can either change the input size of the model to 1024 as the positional embedding is just computed and you have to delete the pos embeddings weights anyway before loading (which would increase your memory requirements during training). Alternatively, you could split the data in 16 256x256 samples.
  • I got the best results with simply dropping the IR channels. Filling them with zeros often leads to worse results. Be aware that during weight loading you have to change the order of the first three channels as Prithvi is trained in the order BGR.
  • The band names between S2 and Landsat are not different. We used the Landsat names and B05, B06 and B07 have a similar wave length as B8A, B11 and B12 from S2.
  • I don't know which scaling would work best. By default, I would use the values from the pre-training (values of the pre-training training data). The training data is in reflectance, so you need to scale them to uint8 for the RGB data. But I am not sure if a scaling with value / 10000 * 255 works good or if the xBD values would work better. Maybe try both.

Interesting, thanks. I will try your suggestions.

I'm still surprised by how the model can adapt to such different input than the pre-training data. I guess I need to dig into the model more and transformers in general!

blumenstiel commented 5 months ago

With high res data it does perform not as good as with low res compared to general vision models. But I assume that it can benefit from the temporal pre-training for this use case which is missing in other models.