microsoft / aurora

Implementation of the Aurora model for atmospheric forecasting
https://microsoft.github.io/aurora
Other
247 stars 31 forks source link

Inquiry About "auraro" Model Detail #47

Open qqydss opened 4 weeks ago

qqydss commented 4 weeks ago

Introduction: I have been following the work on Microsoft's weather model "auraro" and have carefully read through the paper and code. I am writing to seek clarification on some details regarding the experimental setup and model architecture. I would greatly appreciate your insights on the following questions:

  1. When using dataset configuration C4 for pretraining, if the inputs come from different data sources, is it required that their corresponding predicted future GroundTruth all come from the ERA5 dataset? In other words, could there be inputs with the same time label but slightly different, corresponding to the same GT? If so, could this be considered a form of data augmentation similar to distorting images in CV classification?

  2. In the "Comparison with AI models at 0.25° resolution" section, figure 4 shows the x-axis as token_num. Could you please explain how this number is calculated?

  3. For dataset labeled as C3, which has only 3 pressure levels in ensemble mode data, when a batch retrieves ensemble mode data, does the corresponding predicted future GD also only have 3 layers? If so, does it use the same weights for latent level query, atmospheric keys & values as shown in figure 6 of the article when input data has 13 pressure levels?

  4. In Figure 4b, is the input for auraro the "HRES Analysis" from HRES_T0 in 2022, and is the groundtruth ERA5?

  5. In the finetune settings of aurara-0.1°, is the GD ERA5?

  6. In figure 3b, is the input for auraro "HRES Analysis" from HRES-T0? As I understand, HRES starts every 12 hours, so there are only two zero lead time fields per day (00/12). Is the evaluation in figure 3b conducted every 12 hours?

  7. In supplement B.7, formula (9), is x the raw data or normalized data? Additionally, I plotted the curve of x_transformed and x and found they are not a monotonic bijective relationship, which might lead to multiple x corresponding to the same x_transformed, causing information loss. Has this factor been considered regarding its impact on model performance?

Image

8.Could you please elaborate on the process of "embedding dependent on the pressure level" in supplement B.7? For example, how does the tensor shape change? Is this operation only for pollution variables or also for U, V, T, Q, Z? Are the embeddings initialized using the weights from a 12-hour pretrained model for U, V, T, Q, Z, while initializing pollution variables from scratch?

9.In D.3-CAMS 0.4° Analysis, how are the learning rates for the backbone and perceiver-decoder set?

  1. In B.7, "Additional static variables" introduce two constant masks for timestamp. However, both the encoder and swin3d_backbone (AdaptiveLayerNorm) use Fourier encoding for timestamp in the code. Why reintroduce a timestamp mask in the input for pollution forecasting?

  2. In model/film.py, AdaptiveLayerNorm initializes self.ln_modulation’s weights and bias to 0, meaning shift and scale are 0 at the start of training, making the backbone almost equivalent to an identical mapping at the beginning. What is the rationale or empirical support behind this unique initialization method?

  3. In pollution forecasting experiments, concatenating static variables(z, slt, lsm) and atmospheric variables together instead of surface variables, what benefits does this bring? Is it performance improvement or computational efficiency?

13.In the fine tune of auraro-0.1°, when the patch size is increased from 4 to 10, is my understanding correct: are 10×10 patches interpolated into 4×4 patches before entering the embedding module, and then during the perceiver decoder stage, these 4×4 patches are interpolated back to 10×10 before unpatchifying to the forecast field pattern? If my understanding is incorrect, could you provide the correct procedure?

  1. In table 4, HRES-0.1 and HRES-0.25 datasets cover almost the same time span and contain exactly the same variables. Why does HRES-0.1 have far fewer Num frames than HRES-0.25?

Thank you very much for your time and consideration. I am eager to learn from your insights!

wesselb commented 3 weeks ago

Hey @qqydss! Thank you for your very thorough questions. Just a quick message to let you know that we've seen this. :) We will back to you shortly!

qqydss commented 3 weeks ago

Great to hear that you've received my questions and will get back to me soon. Looking forward to your response. Thanks!