mongodb-developer / open-data-covid-19

Open Data Repository for the Covid-19 dataset.
Apache License 2.0
21 stars 10 forks source link

Missing data when group #10

Closed dhienf0707 closed 3 years ago

dhienf0707 commented 3 years ago

I have tried the following aggregation pipeline in the "global_and_us" collection and found out that the data is missing when being grouped:

[ { '$match': { 'country': 'US' } }, { '$group': { '_id': '$date' } }, { '$sort': { '_id': -1 } } ]

Here is a screenshot of MongoDB Compass showing evidence of corrupted US data (since the newest date stops short at 2020-02-20):

image

Here is another pipeline that I used to sum of all the cases in the world, and also found out that the missing data is roughly equal to that of the US:

[ { '$group': { '_id': { 'date': '$date' }, 'confirmed': { '$sum': '$confirmed' } } }, { '$sort': { '_id': -1 } } ]

And the screenshot of MongoDB Compass: image

MaBeuLux88 commented 3 years ago

Hi @dhienf0707 and thanks for reaching out!

You are in "sample mode" in Compass - this is the reason why you do not see the entire data set. But the data is here! Here is a similar pipeline in Compass that shows the data is here. Note that the sample mode in the top right corner is disabled.

image

As you can see, I have 259 documents in the final count which is equal to the number of days between Jan 22nd and Oct 6th.

image

Cheers, Maxime.

dhienf0707 commented 3 years ago

Hi @dhienf0707 and thanks for reaching out!

You are in "sample mode" in Compass - this is the reason why you do not see the entire data set. But the data is here! Here is a similar pipeline in Compass that shows the data is here. Note that the sample mode in the top right corner is disabled.

image

As you can see, I have 259 documents in the final count which is equal to the number of days between Jan 22nd and Oct 6th.

image

Cheers, Maxime.

Hi @MaBeuLux88, yup, u were right. I didn't know that the sample mode would even interfere with the sum of confirmed case. Thank you for for help. I think we can close the issue now.

Cheers, Jerry.

MaBeuLux88 commented 3 years ago

@dhienf0707 I'm working on a "new" version of this data set where I'm adding 3 new fields (confirmed_daily, deaths_daily & recovered_daily) with the daily update calculated from the cumulative data we already have in place in this data set.

You can check out the PR #9 and you can also and a look at the DEV cluster if you like.

mongodb+srv://readonly:readonly@dev.hip2i.mongodb.net/covid19

The data set isn't updated automatically like the prod one - but I just ran the script manually so it's up to date with JHU right now. Also note that you will find some negative values here and there. But that's just because the data set from JHU isn't perfect and they also updated many values & re-assigned deaths counts here and there.

I'd be very happy to see you feedback & opinion.

Cheers, Maxime.

dhienf0707 commented 3 years ago

@MaBeuLux88

Yea, I saw that on one of your branch. Currently, I am calculating daily confirmed, incidence rate (based on yearly basis) for one of my project by '$lookup' and that takes quite a lot of time (especially for the us data). That's so nice now that we will have the daily cases in place. I will surely check that out.

Thank you.

Cheers, Jerry.

MaBeuLux88 commented 3 years ago

Using $lookup works but it is the slow version of this pipeline:

[
        {"$sort": {unique_daily_field: 1, "date": 1}},
        {"$group": {"_id": "$" + unique_daily_field, "docs": {"$push": {"dt": "$date", "c": "$confirmed", "d": "$deaths", "r": "$recovered"}}}},
        {
            "$set": {
                "docs": {
                    "$map": {
                        "input": {"$range": [0, {"$size": "$docs"}]},
                        "as": "idx",
                        "in": {
                            "$let": {
                                "vars": {"d0": {"$arrayElemAt": ["$docs", {"$max": [0, {"$subtract": ["$$idx", 1]}]}]}, "d1": {"$arrayElemAt": ["$docs", "$$idx"]}},
                                "in": {"dt": "$$d1.dt", "dc": {"$subtract": ["$$d1.c", "$$d0.c"]}, "dd": {"$subtract": ["$$d1.d", "$$d0.d"]},
                                       "dr": {"$subtract": ["$$d1.r", "$$d0.r"]}}
                            }
                        }
                    }
                }
            }
        },
        {"$unwind": "$docs"},
        {"$project": {"_id": "$$REMOVE", unique_daily_field: "$_id", "date": "$docs.dt", "confirmed_daily": {"$ifNull": ["$docs.dc", "$$REMOVE"]},
                      "deaths_daily": {"$ifNull": ["$docs.dd", "$$REMOVE"]}, "recovered_daily": {"$ifNull": ["$docs.dr", "$$REMOVE"]}}},
        {"$merge": {"into": collection, "on": [unique_daily_field, "date"], "whenNotMatched": "fail"}}
    ]
dhienf0707 commented 3 years ago

Using $lookup works but it is the slow version of this pipeline:

[
        {"$sort": {unique_daily_field: 1, "date": 1}},
        {"$group": {"_id": "$" + unique_daily_field, "docs": {"$push": {"dt": "$date", "c": "$confirmed", "d": "$deaths", "r": "$recovered"}}}},
        {
            "$set": {
                "docs": {
                    "$map": {
                        "input": {"$range": [0, {"$size": "$docs"}]},
                        "as": "idx",
                        "in": {
                            "$let": {
                                "vars": {"d0": {"$arrayElemAt": ["$docs", {"$max": [0, {"$subtract": ["$$idx", 1]}]}]}, "d1": {"$arrayElemAt": ["$docs", "$$idx"]}},
                                "in": {"dt": "$$d1.dt", "dc": {"$subtract": ["$$d1.c", "$$d0.c"]}, "dd": {"$subtract": ["$$d1.d", "$$d0.d"]},
                                       "dr": {"$subtract": ["$$d1.r", "$$d0.r"]}}
                            }
                        }
                    }
                }
            }
        },
        {"$unwind": "$docs"},
        {"$project": {"_id": "$$REMOVE", unique_daily_field: "$_id", "date": "$docs.dt", "confirmed_daily": {"$ifNull": ["$docs.dc", "$$REMOVE"]},
                      "deaths_daily": {"$ifNull": ["$docs.dd", "$$REMOVE"]}, "recovered_daily": {"$ifNull": ["$docs.dr", "$$REMOVE"]}}},
        {"$merge": {"into": collection, "on": [unique_daily_field, "date"], "whenNotMatched": "fail"}}
    ]

yep, my first variation was almost exactly like this (sort date, then group, push and subtract). But I ended up using lookup with date - 86400000, because I thought sorting will slow down the process (maybe I was wrong). Oh btw, I was wondering why US only has their recovered in the only-country document, the other states and counties doesn't seem to have recovered filed. I suppose it was because of lack of data of some sort.

MaBeuLux88 commented 3 years ago

Yes, precisely!

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

Everything is based on these 5 CSVs + this one: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv

As you can see - I don't have recovered data for the US at the county or state level. It's only at the country level in the time_series_covid19_recovered_global.csv file.

dhienf0707 commented 3 years ago

I am making a Restful API for querying country/state/county infomation. So if country is specified it will return the info about the country as a whole, if country + state is specified it will return the state's info as a whole, if country + state + county is specifiedit will return the info about the county. The format looks like this:

image \ \ This is one of the app that is using my API: image image

I found a way to optimize my pipeline, I should group first, then lookup by pipeline with the exact same group, now everything works pretty smoothly (it was slow before cause I did lookup first then group later). Everything is working perfectly fine, but what bothers me is that the population got doubled when US is chosen. Because US has the only-country doc with its population, group by country and sum population will result in double population, so if US is chosen I need to change '$sum' to '$first' for population. But it is alright for now, haha. By the way this is part of my code for the API, do you have any suggestion in improving this code?

This is my pipeline and part of the code in nodejs:

  let country = req.query.country;
  let state = req.query.state;
  let county = req.query.county;

  let match = {
    'country': country,
  }

  let group = {
    '_id': {
      'country': '$country',
      'date': '$date'
    },
    'confirmed': {
      '$sum': '$confirmed'
    },
    'deaths': {
      '$sum': '$deaths'
    },
    'recovered': {
      '$sum': '$recovered'
    },
    'population': {
      '$sum': '$population'
    }
  }

  let expr = {
    '$and': [
      {
        '$eq': [
          '$country', '$$cur_country'
        ]
      }, {

      }, {

      }, {
        '$or': [
          {
            '$eq': [
              '$date', '$$weekly_date'
            ]
          }, {
            '$eq': [
              '$date', '$$yearly_date'
            ]
          }
        ]
      }
    ]
  }

  if (state !== '' && state !== undefined) {
    match['state'] = state;
    group['_id']['state'] = '$state';
    expr['$and'][1]['$eq'] = ['$state', '$$cur_state'];

    if (county !== '' && county !== undefined) {
      match['county'] = county;
      group['_id']['county'] = '$county';
      expr['$and'][2]['$eq'] = ['$county', '$$cur_county'];
    }
  } else if (country === 'US') {
    group['population'] = {
      $first: '$population'
    }
  }

  const pipeline = [
    {
      '$match': match
    }, {
      '$group': group
    }, {
      '$lookup': {
        'from': 'global_and_us',
        'let': {
          'cur_country': '$_id.country',
          'cur_state': '$_id.state',
          'cur_county': '$_id.county',
          'weekly_date': {
            '$subtract': [
              '$_id.date', 604800000
            ]
          },
          'yearly_date': {
            '$subtract': [
              '$_id.date', 31536000000
            ]
          }
        },
        'pipeline': [
          {
            '$match': {
              '$expr': expr
            }
          }, {
            '$group': group
          }
        ],
        'as': 'result'
      }
    }, {
      '$set': {
        'weekly_confirmed': {
          '$subtract': [
            '$confirmed', {
              '$ifNull': [
                {
                  '$arrayElemAt': [
                    '$result.confirmed', 0
                  ]
                }, 0
              ]
            }
          ]
        },
        'weekly_deaths': {
          '$subtract': [
            '$deaths', {
              '$ifNull': [
                {
                  '$arrayElemAt': [
                    '$result.deaths', 0
                  ]
                }, 0
              ]
            }
          ]
        },
        'yearly_confirmed': {
          '$subtract': [
            '$confirmed', {
              '$ifNull': [
                {
                  '$arrayElemAt': [
                    '$result.confirmed', 1
                  ]
                }, 0
              ]
            }
          ]
        }
      }
    }, {
      '$addFields': {
        'active': {
          '$subtract': [
            {
              '$subtract': [
                '$confirmed', '$deaths'
              ]
            }, '$recovered'
          ]
        },
        'incidence': {
          '$divide': [
            {
              '$multiply': [
                '$yearly_confirmed', 100000
              ]
            }, '$population'
          ]
        },
        'fatality': {
          '$cond': [
            {
              '$eq': [
                '$confirmed', 0
              ]
            }, 0, {
              '$multiply': [
                {
                  '$divide': [
                    '$deaths', '$confirmed'
                  ]
                }, 100
              ]
            }
          ]
        }
      }
    }, {
      '$unset': [
        'result', 'yearly_confirmed'
      ]
    }, {
      '$sort': {
        '_id': 1
      }
    }
  ]
MaBeuLux88 commented 3 years ago

1/ I already did a REST API and a GraphQL API ;-) !

2/ I don't like this population issue you found. In PROD:

In DEV:

The countries_summary is not affected by this "double count" because it's based on the global collection - which has an accurate count. The bug I'm fixing in the PR #9 was to fix the issue with Canada, France and China (and probably a few others) that are sub-divided into states, provinces or regions - that's the reason why I need a $sum and not a $first.

3/ Pipeline optimisation... This seems wrong because you don't know which population you will get in the end as you are not sorting or doing anything in particular to make sure you are retrieving the population of the entire country.

if (country === 'US') {
    group['population'] = {
        $first: '$population'
    }
}

The $lookup is definitely the fat one in here. I'm wondering if there is a way to find another solution to do the same thing without it but I don't see it at the moment. Everything else looks fine. On which collection are you running this pipeline? Why not use the countries_summary collection to make this entire pipeline more simple? You wouldn't have the state or county level but... Depends what you want to do.

dhienf0707 commented 3 years ago
  1. Oh I didn't know that, haha.

  2. Yea, the population seems pretty weird, even in the country_summary, France doesn't have the right population (298682), whereas in global collection it seems to be right (65273512).

  1. Cause I match by country first, and then group by {country + date}, so I thought that the first doc will always be the country-only doc, since it has the lower ObjectId and I didn't change any order, turned out I was wrong.

What I am trying to do here is to make a consistent api that works for all cases, my api url looks like this:

/api/getinfo?country=&state=&county=

if no country is specified it will return the world wide info, if only country is specified it will return that country as a whole, same for state and county.

The collection that I am using is global_and_us, my data is pretty on spot except for the population and incidence rate (since incidence rate is based on population). I think that I can change the collection based on condition: like if nothing is specified or if only country is specified, I can use the country_summary collection. But I found out that even country_summary collection has something wrong with the population (e.g. France).

The world population seems to be odd here: image

France seems to be odd: image

Same for Canada, I don't know if there are any other countries that goes wrong: image

With the given error in country summary, the only way for me to get a nearly correct population of the world is to set the population of the country-only doc to 0: image

I think for the countries that already have their own population, confirmed, deaths, recovered per region (e.g. France), we should drop the country-only doc, since grouping and summing them will return in wrong cases (confirmed, deaths, recovered) and wrong population. Till now I think France is the only exception, it has both country-only doc and the region doc, with all of the cases. image image

The US and Canada is fine for me now cause their country-only doc only has recovered and regional specific doc has confirmed/deaths which will result in right result when being group.

dhienf0707 commented 3 years ago

@MaBeuLux88 I finally figured it all out. This is the final code for the api. It can handles all countries/states/counties, I also managed to solved the population/cases conflict for US, Canada and France. You were right, using lookup was a huge mistake of mine.

I put the whole router code here so it would be easier to understand the context:

router.get('/graphinfo', async (req, res, next) => {
  const covid19 = req.app.mongodb.db("covid19");
  let country = req.query.country === undefined ? '' : req.query.country;
  let state = req.query.state === undefined ? '' : req.query.state;
  let county = req.query.county === undefined ? '' : req.query.county;
  let key = `graphinfo_${country}_${state}_${county}`;

  let match = {
    'loc': {
      '$exists': true
    }
  }

  let set = {
    '$set': {
      'population': {
        '$cond': [
          {
            '$and': [
              {
                '$anyElementTrue': [
                  {
                    '$map': {
                      'input': [
                        'US', 'Canada', 'France'
                      ],
                      'as': 'el',
                      'in': {
                        '$eq': [
                          '$$el', '$country'
                        ]
                      }
                    }
                  }
                ]
              }, '$state'
            ]
          }, 0, '$population'
        ]
      },
      'confirmed': {
        '$cond': [
          {
            '$and': [
              {
                '$anyElementTrue': [
                  {
                    '$map': {
                      'input': [
                        'France'
                      ],
                      'as': 'el',
                      'in': {
                        '$eq': [
                          '$$el', '$country'
                        ]
                      }
                    }
                  }
                ]
              }, '$state'
            ]
          }, 0, '$confirmed'
        ]
      },
      'deaths': {
        '$cond': [
          {
            '$and': [
              {
                '$anyElementTrue': [
                  {
                    '$map': {
                      'input': [
                        'France'
                      ],
                      'as': 'el',
                      'in': {
                        '$eq': [
                          '$$el', '$country'
                        ]
                      }
                    }
                  }
                ]
              }, '$state'
            ]
          }, 0, '$deaths'
        ]
      },
      'recovered': {
        '$cond': [
          {
            '$and': [
              {
                '$anyElementTrue': [
                  {
                    '$map': {
                      'input': [
                        'France', 'US', 'Canada'
                      ],
                      'as': 'el',
                      'in': {
                        '$eq': [
                          '$$el', '$country'
                        ]
                      }
                    }
                  }
                ]
              }, '$state'
            ]
          }, 0, '$recovered'
        ]
      }
    }
  }

  let _id = {
    'date': '$date'
  }

  if (country !== '') {
    match['country'] = country;
    _id['country'] = '$country';

    if (state !== '') {
      set = {
        '$match': {}
      }
      match['state'] = state;
      _id['state'] = '$state';

      if (county !== '') {
        match['county'] = county;
        _id['county'] = '$county';
      }
    }
  }

  const pipeline = [
    {
      '$match': match
    }, {
      '$sort': {
          'date': 1
      }
    }, set, {
      '$group': {
        '_id': '$loc',
        'docs': {
          '$push': {
            'date': '$date',
            'country': '$country',
            'state': '$state',
            'county': '$county',
            'confirmed': '$confirmed',
            'deaths': '$deaths',
            'recovered': '$recovered',
            'population': '$population'
          }
        }
      }
    }, {
      '$set': {
        'docs': {
          '$map': {
            'input': {
              '$range': [
                0, {
                  '$size': '$docs'
                }
              ]
            },
            'as': 'idx',
            'in': {
              '$let': {
                'vars': {
                  'yearly': {
                    '$arrayElemAt': [
                      '$docs', {
                        '$max': [
                          0, {
                            '$subtract': [
                              '$$idx', 365
                            ]
                          }
                        ]
                      }
                    ]
                  },
                  'weekly': {
                    '$arrayElemAt': [
                      '$docs', {
                        '$max': [
                          0, {
                            '$subtract': [
                              '$$idx', 7
                            ]
                          }
                        ]
                      }
                    ]
                  },
                  'this': {
                    '$arrayElemAt': [
                      '$docs', '$$idx'
                    ]
                  }
                },
                'in': {
                  'date': '$$this.date',
                  'country': '$$this.country',
                  'state': '$$this.state',
                  'county': '$$this.county',
                  'confirmed': '$$this.confirmed',
                  'deaths': '$$this.deaths',
                  'recovered': '$$this.recovered',
                  'weekly_confirmed': {
                    '$subtract': [
                      '$$this.confirmed', '$$weekly.confirmed'
                    ]
                  },
                  'weekly_deaths': {
                    '$subtract': [
                      '$$this.deaths', '$$weekly.deaths'
                    ]
                  },
                  'yearly_confirmed': {
                    '$subtract': [
                      '$$this.confirmed', '$$yearly.confirmed'
                    ]
                  },
                  'population': '$$this.population'
                }
              }
            }
          }
        }
      }
    }, {
      '$unwind': '$docs'
    }, {
      '$replaceRoot': {
        'newRoot': '$docs'
      }
    }, {
      '$group': {
        '_id': _id,
        'confirmed': {
          '$sum': '$confirmed'
        },
        'deaths': {
          '$sum': '$deaths'
        },
        'recovered': {
          '$sum': '$recovered'
        },
        'weekly_confirmed': {
          '$sum': '$weekly_confirmed'
        },
        'weekly_deaths': {
          '$sum': '$weekly_deaths'
        },
        'yearly_confirmed': {
          '$sum': '$yearly_confirmed'
        },
        'population': {
          '$sum': '$population'
        }
      }
    }, {
      '$set': {
        'active': {
          '$subtract': [
            {
              '$subtract': [
                '$confirmed', '$deaths'
              ]
            }, '$recovered'
          ]
        },
        'incidence': {
          '$divide': [
            {
              '$multiply': [
                '$yearly_confirmed', 100000
              ]
            }, '$population'
          ]
        },
        'fatality': {
          '$cond': [
            {
              '$eq': [
                '$confirmed', 0
              ]
            }, 0, {
              '$multiply': [
                {
                  '$divide': [
                    '$deaths', '$confirmed'
                  ]
                }, 100
              ]
            }
          ]
        }
      }
    }, {
      '$sort': {
        '_id.date': 1
      }
    }
  ]

  redis_get(key)
    .then(async (result) => {
      if (result) {
        res.status(200).json(JSON.parse(result));
        throw `Caught '${key}' in cache`;
      } else {
        // get data from database and store in cache
        return covid19.collection("global_and_us").aggregate(pipeline, { allowDiskUse:true }).toArray()
      }
    })
    .then(async (result) => {
      // store the result from mongodb to cache
      redis_client.setex(key, 28800, JSON.stringify({ source: "Redis Cache", result: result }));

      // send the retult back to client
      return res.send({ source: "Mongodb", result: result });
    })
    .catch(err => console.log(err))
})
MaBeuLux88 commented 3 years ago

That looks very neat!! Would you mind sharing this into the MongoDB Community forum here & we can continue the discussion there? https://developer.mongodb.com/community/forums/t/devhub-a-free-rest-api-for-johns-hopkins-university-covid-19-dataset/8915

It's kinda hidden here and nobody can see this awesome pipeline! I think some countries are missing in your list ;) !

dhienf0707 commented 3 years ago

Yea sure, I will write a specific comment with the context behind this code when I have time. I am quite busy at school these days haha 😂.