nextgrid / dll-ideas

Empty repo for discussions about issues
0 stars 0 forks source link

DLL Global #1 challenge #14

Open jacekplocharczyk opened 4 years ago

jacekplocharczyk commented 4 years ago

DLL Global #1

04.04.2020

This time we will have a standard deep learning challenge.


General Info:

Challenge: Predict COVID-19 spreading across specific countries and worldwide Criterion: Root mean squared logarithmic error (RMSLE)

Additional challenge: Find meaningful insights about COVID-19 (an open question)

Test set: see general TODOs


Dataset info/TODOs:

  1. We could use the data from the Kaggle competition (Alternative: ECDC data) I think that the license check is required.
  2. We could use the data from the World Bank about economical indicators for each country (some helping materials) Manual data extraction could be required - we can use this code or it's also possible to download .csv files directly from their page
  3. We could find a dataset with travel restrictions, airports shutdowns, etc. to combine it with the rest of the data. Example of such data: https://www.kaggle.com/paultimothymooney/covid19-containment-and-mitigation-measures
  4. We could use climate data for each country.

General TODOs:

  1. ~~Describe evaluation metrics for the main challenge We could use column-wise root mean squared logarithmic error like in the Kaggle competition~~
    We will use RMSLE for evaluation.
  2. Define evaluation datasets We could prepare two test sets - one from last week (public used only for evaluation if the submission is working properly, it could overlap with the trainset) and the other based on the week after hackathon with public leaderboard updated daily. @raznem @Mindgames what do you think?
  3. Should we predict infected or deaths? It is a delicate matter so we should be careful here.
  4. Describe evaluation metrics for the additional challenge @Mindgames please let us know when you'll find someone (or some company/organization)

We will update this post when we find out more.

raznem commented 4 years ago

One comment about data licensing:

  1. Licencing of the data: The dataset is publically shared on GitHub by John Hopkins University https://github.com/CSSEGISandData/COVID-19, there are also conditions of license. (educational and academic purposes, hope competition fits in this definition)

TODOs:

  1. RMSLE is the most obvious and safe choice for evaluation.
  2. It's a tricky part because we can use only public data and people can easily submit scrapped samples. So, I am not sure how it should work out, I think the only option to have interactive leaderboard is to use smth like the last day of stats.
  3. Why not both?
jacekplocharczyk commented 4 years ago

One comment about data licensing:

  1. Licencing of the data: The dataset is publically shared on GitHub by John Hopkins University https://github.com/CSSEGISandData/COVID-19, there are also conditions of license. (educational and academic purposes, hope competition fits in this definition)

For me, it should be OK. @Mindgames what do you think?


  1. RMSLE is the most obvious and safe choice for evaluation.

So let's use it. I've added it to the main post.


  1. It's a tricky part because we can use only public data and people can easily submit scrapped samples. So, I am not sure how it should work out, I think the only option to have interactive leaderboard is to use smth like the last day of stats.

Participants send us their predictions for the next week in advance. We store them and rate day by day updating leaderboard.


  1. Why not both?

For some, it could be controversial. There is some disclaimer at the Kaggle competition website:

We understand this is a serious situation, and in no way want to trivialize the human impact this crisis is causing by predicting fatalities. Our goal is to provide better methods for estimates that can assist medical and governmental institutions to prepare and adjust as pandemics unfold.

If we will look professional it's OK for me.

raznem commented 4 years ago

@jacekplocharczyk

Participants send us their predictions for the next week in advance. We store them and rate day by day updating leaderboard.

I meant an interactive part where participants send their .csv and see the results on the leaderboard instantly.

raznem commented 4 years ago

Like we can have the last day without labels and hope that people will not scrape them, then participants can compare how good they are against others.

jacekplocharczyk commented 4 years ago

What about creating a leader board with two columns: training data error (last 7 days) and upcoming data error (next 7 days and the data from the first days of the hackathon - day 0).

  1. Training data error would give only intuition on how well a team is doing - not influencing the final place.
  2. Upcoming data error would be the main target to minimize. We could give teams first results (day 0) after they would be reviled (probably a few hours after midnight or at 8 am. Event ends at 11 am).

We could also count 0-day error as only 10% so it shouldn't be the deciding factor.

Edit. @Mindgames what countries do we care about? Should teams predict only worldwide data (won't be so insightful) or let's make few categories, e.g.:

I've added also idea of adding climate data for each country to the main post.

raznem commented 4 years ago

@jacekplocharczyk

  1. Training data error would give only intuition on how well a team is doing - not influencing the final place.

Training data will be very misleading, cause it's only the measurement of overfitting, not how well teams are doing. That means that you will have a leaderboard where people with bad and overfitted models will be on top, which takes away the main reason for having leaderboard - compare performance with others.

  1. Upcoming data error would be the main target to minimize. We could give teams first results (day 0) after they would be reviled (probably a few hours after midnight or at 8 am. Event ends at 11 am).

You are right, we should update scores after data is available, then we will make it more interactive. However, I think that we should use this 0-day error as part of the public leaderboard since our metric is just an average value we can just recalculate the average scores with new data.

We could also count 0-day error as only 10% so it shouldn't be the deciding factor.

Why not just ignore them when calculating final scores, then nobody will be able to get extra from cheating?

I've added also idea of adding climate data for each country to the main post.

:+1:

jacekplocharczyk commented 4 years ago

@raznem

Training data will be very misleading, cause it's only the measurement of overfitting, not how well teams are doing. That means that you will have a leaderboard where people with bad and overfitted models will be on top, which takes away the main reason for having leaderboard - compare performance with others.

You are right but for me, it's only for checking if a submission was sent successfully. Of course, teams should be aware that they can overfit on this train leaderboard but since it is only meaningful until we get 0-day results I wouldn't be bothered.

We could also replace train leaderboard with a 0-day leaderboard in the morning. <- this is a good idea for me.


  1. Upcoming data error would be the main target to minimize. We could give teams first results (day 0) after they would be reviled (probably a few hours after midnight or at 8 am. Event ends at 11 am).

You are right, we should update scores after data is available, then we will make it more interactive. However, I think that we should use this 0-day error as part of the public leaderboard since our metric is just an average value we can just recalculate the average scores with new data.

We could also count 0-day error as only 10% so it shouldn't be the deciding factor.

Why not just ignore them when calculating final scores, then nobody will be able to get extra from cheating?

So do we agree that our plan will be the following:

  1. Use train leaderboard until 8 AM <- only informative purpose
  2. Use 0-day leaderboard since 8 AM <- only informative purpose
  3. Update leader board every day after hackathon at 8 AM
raznem commented 4 years ago

@jacekplocharczyk

You are right but for me, it's only for checking if a submission was sent successfully. Of course, teams should be aware that they can overfit on this train leaderboard but since it is only meaningful until we get 0-day results I wouldn't be bothered.

If you will check kaggle or other competitions with leaderboards they have public leaderboard during all event. This creating competition spirit causes you can see how good you are right now against others. If you want to check whether the submission is correct you can just accept results without any leaderboard, just send feedback that it's correct. However, In case when we want to keep public leaderboard if we will not use some hidden labels it will be useless till 8 AM which is close to the end of the competition. So if we will accept the results from the train it shouldn't be public cause it only can mislead some people.

From my perspective, we have a tradeoff between more data to participants and more competition from the beginning. I think that 1-day data is worth losing to have competition from the start, not waiting till 8 AM which is close to the end. How do you see this tradeoff?

We could also replace train leaderboard with a 0-day leaderboard in the morning. <- this is a good idea for me.

Agree, we can update results live, during the competition itself to make more intense :)

  1. Update leader board every day after hackathon at 8 AM

Maybe let's update it not so early in the morning in case smth will be broken? :)