sir-lab / data-release

Huawei Cloud datasets
48 stars 7 forks source link

How to deal with "NaN" value? #4

Closed z-hou closed 3 months ago

z-hou commented 3 months ago

When I analysis the data, I found that in the CSV file, there are so many blank cells, and they are represented by "NaN" value in pandas. So I want to ask what's the exact meaning of this "NaN" value? And are they missing values? If so, could you please let me know how you deal with it in the invocation prediction task in your implementation.

ajoosen commented 3 months ago

Hi,

In our dataset, NaN values usually mean that this metric was not collected for this timestamp because there were no requests for that function at that time. For function invocation prediction, we only do this for functions that have many requests and usually a small number of NaNs.

Once we have chosen our function, there may still be some NaNs that we have to handle while training a forecasting model. We usually do this by replacing them with a constant value (e.g. 0) or by throwing away the sample. For example, for an input window of 360 and output window of 60, we may define a threshold that, the input window must have less than 30 NaNs and the output window must have less than 10 NaNs. We can then replace these few NaNs with 0 or another value. If there are more NaNs than the threshold, we skip this sample and use another one. This threshold will vary per function and you will have to tune this yourself to see how many NaNs you can have without adversely affecting your model's performance.

Missing value handling is a developing area of research and we encourage you to look at our ICLR paper "DAM: Towards a Foundation Model for Forecasting" (https://iclr.cc/virtual/2024/poster/19467) where we use probabilistic sampling to handle the NaN issue natively.