Missing data when group

I have tried the following aggregation pipeline in the "global_and_us" collection and found out that the data is missing when being grouped:

[ { '$match': { 'country': 'US' } }, { '$group': { '_id': '$date' } }, { '$sort': { '_id': -1 } } ]

Here is a screenshot of MongoDB Compass showing evidence of corrupted US data (since the newest date stops short at 2020-02-20):

Here is another pipeline that I used to sum of all the cases in the world, and also found out that the missing data is roughly equal to that of the US:

[ { '$group': { '_id': { 'date': '$date' }, 'confirmed': { '$sum': '$confirmed' } } }, { '$sort': { '_id': -1 } } ]

And the screenshot of MongoDB Compass:

Hi @dhienf0707 and thanks for reaching out!

You are in "sample mode" in Compass - this is the reason why you do not see the entire data set. But the data is here! Here is a similar pipeline in Compass that shows the data is here. Note that the sample mode in the top right corner is disabled.

As you can see, I have 259 documents in the final count which is equal to the number of days between Jan 22nd and Oct 6th.

Cheers, Maxime.

Hi @dhienf0707 and thanks for reaching out!

You are in "sample mode" in Compass - this is the reason why you do not see the entire data set. But the data is here! Here is a similar pipeline in Compass that shows the data is here. Note that the sample mode in the top right corner is disabled.

As you can see, I have 259 documents in the final count which is equal to the number of days between Jan 22nd and Oct 6th.

Cheers, Maxime.

Hi @MaBeuLux88, yup, u were right. I didn't know that the sample mode would even interfere with the sum of confirmed case. Thank you for for help. I think we can close the issue now.

Cheers, Jerry.

@dhienf0707 I'm working on a "new" version of this data set where I'm adding 3 new fields (confirmed_daily, deaths_daily & recovered_daily) with the daily update calculated from the cumulative data we already have in place in this data set.

You can check out the PR #9 and you can also and a look at the DEV cluster if you like.

mongodb+srv://readonly:readonly@dev.hip2i.mongodb.net/covid19

The data set isn't updated automatically like the prod one - but I just ran the script manually so it's up to date with JHU right now. Also note that you will find some negative values here and there. But that's just because the data set from JHU isn't perfect and they also updated many values & re-assigned deaths counts here and there.

I'd be very happy to see you feedback & opinion.

Cheers, Maxime.

@MaBeuLux88

Yea, I saw that on one of your branch. Currently, I am calculating daily confirmed, incidence rate (based on yearly basis) for one of my project by '$lookup' and that takes quite a lot of time (especially for the us data). That's so nice now that we will have the daily cases in place. I will surely check that out.

Thank you.

Cheers, Jerry.

Using $lookup works but it is the slow version of this pipeline:

[
        {"$sort": {unique_daily_field: 1, "date": 1}},
        {"$group": {"_id": "$" + unique_daily_field, "docs": {"$push": {"dt": "$date", "c": "$confirmed", "d": "$deaths", "r": "$recovered"}}}},
        {
            "$set": {
                "docs": {
                    "$map": {
                        "input": {"$range": [0, {"$size": "$docs"}]},
                        "as": "idx",
                        "in": {
                            "$let": {
                                "vars": {"d0": {"$arrayElemAt": ["$docs", {"$max": [0, {"$subtract": ["$$idx", 1]}]}]}, "d1": {"$arrayElemAt": ["$docs", "$$idx"]}},
                                "in": {"dt": "$$d1.dt", "dc": {"$subtract": ["$$d1.c", "$$d0.c"]}, "dd": {"$subtract": ["$$d1.d", "$$d0.d"]},
                                       "dr": {"$subtract": ["$$d1.r", "$$d0.r"]}}
                            }
                        }
                    }
                }
            }
        },
        {"$unwind": "$docs"},
        {"$project": {"_id": "$$REMOVE", unique_daily_field: "$_id", "date": "$docs.dt", "confirmed_daily": {"$ifNull": ["$docs.dc", "$$REMOVE"]},
                      "deaths_daily": {"$ifNull": ["$docs.dd", "$$REMOVE"]}, "recovered_daily": {"$ifNull": ["$docs.dr", "$$REMOVE"]}}},
        {"$merge": {"into": collection, "on": [unique_daily_field, "date"], "whenNotMatched": "fail"}}
    ]

Using $lookup works but it is the slow version of this pipeline:

[
        {"$sort": {unique_daily_field: 1, "date": 1}},
        {"$group": {"_id": "$" + unique_daily_field, "docs": {"$push": {"dt": "$date", "c": "$confirmed", "d": "$deaths", "r": "$recovered"}}}},
        {
            "$set": {
                "docs": {
                    "$map": {
                        "input": {"$range": [0, {"$size": "$docs"}]},
                        "as": "idx",
                        "in": {
                            "$let": {
                                "vars": {"d0": {"$arrayElemAt": ["$docs", {"$max": [0, {"$subtract": ["$$idx", 1]}]}]}, "d1": {"$arrayElemAt": ["$docs", "$$idx"]}},
                                "in": {"dt": "$$d1.dt", "dc": {"$subtract": ["$$d1.c", "$$d0.c"]}, "dd": {"$subtract": ["$$d1.d", "$$d0.d"]},
                                       "dr": {"$subtract": ["$$d1.r", "$$d0.r"]}}
                            }
                        }
                    }
                }
            }
        },
        {"$unwind": "$docs"},
        {"$project": {"_id": "$$REMOVE", unique_daily_field: "$_id", "date": "$docs.dt", "confirmed_daily": {"$ifNull": ["$docs.dc", "$$REMOVE"]},
                      "deaths_daily": {"$ifNull": ["$docs.dd", "$$REMOVE"]}, "recovered_daily": {"$ifNull": ["$docs.dr", "$$REMOVE"]}}},
        {"$merge": {"into": collection, "on": [unique_daily_field, "date"], "whenNotMatched": "fail"}}
    ]

yep, my first variation was almost exactly like this (sort date, then group, push and subtract). But I ended up using lookup with date - 86400000, because I thought sorting will slow down the process (maybe I was wrong). Oh btw, I was wondering why US only has their recovered in the only-country document, the other states and counties doesn't seem to have recovered filed. I suppose it was because of lack of data of some sort.

Yes, precisely!

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

Everything is based on these 5 CSVs + this one: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv

As you can see - I don't have recovered data for the US at the county or state level. It's only at the country level in the time_series_covid19_recovered_global.csv file.

I am making a Restful API for querying country/state/county infomation. So if country is specified it will return the info about the country as a whole, if country + state is specified it will return the state's info as a whole, if country + state + county is specifiedit will return the info about the county. The format looks like this:

\ \ This is one of the app that is using my API:

I found a way to optimize my pipeline, I should group first, then lookup by pipeline with the exact same group, now everything works pretty smoothly (it was slow before cause I did lookup first then group later). Everything is working perfectly fine, but what bothers me is that the population got doubled when US is chosen. Because US has the only-country doc with its population, group by country and sum population will result in double population, so if US is chosen I need to change '$sum' to '$first' for population. But it is alright for now, haha. By the way this is part of my code for the API, do you have any suggestion in improving this code?

This is my pipeline and part of the code in nodejs:

  let country = req.query.country;
  let state = req.query.state;
  let county = req.query.county;

  let match = {
    'country': country,
  }

  let group = {
    '_id': {
      'country': '$country',
      'date': '$date'
    },
    'confirmed': {
      '$sum': '$confirmed'
    },
    'deaths': {
      '$sum': '$deaths'
    },
    'recovered': {
      '$sum': '$recovered'
    },
    'population': {
      '$sum': '$population'
    }
  }

  let expr = {
    '$and': [
      {
        '$eq': [
          '$country', '$$cur_country'
        ]
      }, {

      }, {

      }, {
        '$or': [
          {
            '$eq': [
              '$date', '$$weekly_date'
            ]
          }, {
            '$eq': [
              '$date', '$$yearly_date'
            ]
          }
        ]
      }
    ]
  }

  if (state !== '' && state !== undefined) {
    match['state'] = state;
    group['_id']['state'] = '$state';
    expr['$and'][1]['$eq'] = ['$state', '$$cur_state'];

    if (county !== '' && county !== undefined) {
      match['county'] = county;
      group['_id']['county'] = '$county';
      expr['$and'][2]['$eq'] = ['$county', '$$cur_county'];
    }
  } else if (country === 'US') {
    group['population'] = {
      $first: '$population'
    }
  }

  const pipeline = [
    {
      '$match': match
    }, {
      '$group': group
    }, {
      '$lookup': {
        'from': 'global_and_us',
        'let': {
          'cur_country': '$_id.country',
          'cur_state': '$_id.state',
          'cur_county': '$_id.county',
          'weekly_date': {
            '$subtract': [
              '$_id.date', 604800000
            ]
          },
          'yearly_date': {
            '$subtract': [
              '$_id.date', 31536000000
            ]
          }
        },
        'pipeline': [
          {
            '$match': {
              '$expr': expr
            }
          }, {
            '$group': group
          }
        ],
        'as': 'result'
      }
    }, {
      '$set': {
        'weekly_confirmed': {
          '$subtract': [
            '$confirmed', {
              '$ifNull': [
                {
                  '$arrayElemAt': [
                    '$result.confirmed', 0
                  ]
                }, 0
              ]
            }
          ]
        },
        'weekly_deaths': {
          '$subtract': [
            '$deaths', {
              '$ifNull': [
                {
                  '$arrayElemAt': [
                    '$result.deaths', 0
                  ]
                }, 0
              ]
            }
          ]
        },
        'yearly_confirmed': {
          '$subtract': [
            '$confirmed', {
              '$ifNull': [
                {
                  '$arrayElemAt': [
                    '$result.confirmed', 1
                  ]
                }, 0
              ]
            }
          ]
        }
      }
    }, {
      '$addFields': {
        'active': {
          '$subtract': [
            {
              '$subtract': [
                '$confirmed', '$deaths'
              ]
            }, '$recovered'
          ]
        },
        'incidence': {
          '$divide': [
            {
              '$multiply': [
                '$yearly_confirmed', 100000
              ]
            }, '$population'
          ]
        },
        'fatality': {
          '$cond': [
            {
              '$eq': [
                '$confirmed', 0
              ]
            }, 0, {
              '$multiply': [
                {
                  '$divide': [
                    '$deaths', '$confirmed'
                  ]
                }, 100
              ]
            }
          ]
        }
      }
    }, {
      '$unset': [
        'result', 'yearly_confirmed'
      ]
    }, {
      '$sort': {
        '_id': 1
      }
    }
  ]

1/ I already did a REST API and a GraphQL API ;-) !

2/ I don't like this population issue you found. In PROD:

In the countries_summary collection, looks like I have the correct value: 329466283
In the us_only collection, looks like there is an issue because if I group by $date and $sum the $population, I only have 79762538 for each date... That seems wrong.
In the global collection, I can match "US" and group by {date + country} and $sum the $population and again I have 329466283 for all the documents.
In the global_and_us collection, with the same collection, I have indeed a double count: 663252295 because I have the country alone for the recoveries and the counties for the confirmed and deaths... I didn't anticipate this to be honest but I'm not sure it's can truly be considered a "mistake" because the documents - without grouping them like this - make sense. But this issue appears when we are regrouping the documents together... If you want an accurate view of the countries - I think you should consider using the countries_summary collection instead rather than redoing what I've done to get an accurate daily snapshots of each country.

In DEV:

In the countries_summary collection: We are still good: 329466283
Same result for the us_only collection: 79762538 - expected because I didn't touch it - yet - but I don't see at the moment what I could do to fix it.
In the global collection: still 329466283
In the global_and_us collection: still 663252295... Note that 329466283 * 2 = 658932566 != 663252295... So there is something else here. But that's the data JHU provides so hey - What can I do?! :+1:

The countries_summary is not affected by this "double count" because it's based on the global collection - which has an accurate count. The bug I'm fixing in the PR #9 was to fix the issue with Canada, France and China (and probably a few others) that are sub-divided into states, provinces or regions - that's the reason why I need a $sum and not a $first.

3/ Pipeline optimisation... This seems wrong because you don't know which population you will get in the end as you are not sorting or doing anything in particular to make sure you are retrieving the population of the entire country.

if (country === 'US') {
    group['population'] = {
        $first: '$population'
    }
}

The $lookup is definitely the fat one in here. I'm wondering if there is a way to find another solution to do the same thing without it but I don't see it at the moment. Everything else looks fine. On which collection are you running this pipeline? Why not use the countries_summary collection to make this entire pipeline more simple? You wouldn't have the state or county level but... Depends what you want to do.

Oh I didn't know that, haha.
Yea, the population seems pretty weird, even in the country_summary, France doesn't have the right population (298682), whereas in global collection it seems to be right (65273512).

I think I understand what is hapenning with the 663252295 US population in global_and_us. Because the sum of all states population is 333786012 and the country's population is 329466283. As you can see here when I group by country with only state the population is 333786012, and 329466283 for the US-only doc:

Cause I match by country first, and then group by {country + date}, so I thought that the first doc will always be the country-only doc, since it has the lower ObjectId and I didn't change any order, turned out I was wrong.

What I am trying to do here is to make a consistent api that works for all cases, my api url looks like this:

/api/getinfo?country=&state=&county=

if no country is specified it will return the world wide info, if only country is specified it will return that country as a whole, same for state and county.

The collection that I am using is global_and_us, my data is pretty on spot except for the population and incidence rate (since incidence rate is based on population). I think that I can change the collection based on condition: like if nothing is specified or if only country is specified, I can use the country_summary collection. But I found out that even country_summary collection has something wrong with the population (e.g. France).

The world population seems to be odd here:

France seems to be odd:

Same for Canada, I don't know if there are any other countries that goes wrong:

With the given error in country summary, the only way for me to get a nearly correct population of the world is to set the population of the country-only doc to 0:

I think for the countries that already have their own population, confirmed, deaths, recovered per region (e.g. France), we should drop the country-only doc, since grouping and summing them will return in wrong cases (confirmed, deaths, recovered) and wrong population. Till now I think France is the only exception, it has both country-only doc and the region doc, with all of the cases.

The US and Canada is fine for me now cause their country-only doc only has recovered and regional specific doc has confirmed/deaths which will result in right result when being group.

@MaBeuLux88 I finally figured it all out. This is the final code for the api. It can handles all countries/states/counties, I also managed to solved the population/cases conflict for US, Canada and France. You were right, using lookup was a huge mistake of mine.

I put the whole router code here so it would be easier to understand the context:

router.get('/graphinfo', async (req, res, next) => {
  const covid19 = req.app.mongodb.db("covid19");
  let country = req.query.country === undefined ? '' : req.query.country;
  let state = req.query.state === undefined ? '' : req.query.state;
  let county = req.query.county === undefined ? '' : req.query.county;
  let key = `graphinfo_${country}_${state}_${county}`;

  let match = {
    'loc': {
      '$exists': true
    }
  }

  let set = {
    '$set': {
      'population': {
        '$cond': [
          {
            '$and': [
              {
                '$anyElementTrue': [
                  {
                    '$map': {
                      'input': [
                        'US', 'Canada', 'France'
                      ],
                      'as': 'el',
                      'in': {
                        '$eq': [
                          '$$el', '$country'
                        ]
                      }
                    }
                  }
                ]
              }, '$state'
            ]
          }, 0, '$population'
        ]
      },
      'confirmed': {
        '$cond': [
          {
            '$and': [
              {
                '$anyElementTrue': [
                  {
                    '$map': {
                      'input': [
                        'France'
                      ],
                      'as': 'el',
                      'in': {
                        '$eq': [
                          '$$el', '$country'
                        ]
                      }
                    }
                  }
                ]
              }, '$state'
            ]
          }, 0, '$confirmed'
        ]
      },
      'deaths': {
        '$cond': [
          {
            '$and': [
              {
                '$anyElementTrue': [
                  {
                    '$map': {
                      'input': [
                        'France'
                      ],
                      'as': 'el',
                      'in': {
                        '$eq': [
                          '$$el', '$country'
                        ]
                      }
                    }
                  }
                ]
              }, '$state'
            ]
          }, 0, '$deaths'
        ]
      },
      'recovered': {
        '$cond': [
          {
            '$and': [
              {
                '$anyElementTrue': [
                  {
                    '$map': {
                      'input': [
                        'France', 'US', 'Canada'
                      ],
                      'as': 'el',
                      'in': {
                        '$eq': [
                          '$$el', '$country'
                        ]
                      }
                    }
                  }
                ]
              }, '$state'
            ]
          }, 0, '$recovered'
        ]
      }
    }
  }

  let _id = {
    'date': '$date'
  }

  if (country !== '') {
    match['country'] = country;
    _id['country'] = '$country';

    if (state !== '') {
      set = {
        '$match': {}
      }
      match['state'] = state;
      _id['state'] = '$state';

      if (county !== '') {
        match['county'] = county;
        _id['county'] = '$county';
      }
    }
  }

  const pipeline = [
    {
      '$match': match
    }, {
      '$sort': {
          'date': 1
      }
    }, set, {
      '$group': {
        '_id': '$loc',
        'docs': {
          '$push': {
            'date': '$date',
            'country': '$country',
            'state': '$state',
            'county': '$county',
            'confirmed': '$confirmed',
            'deaths': '$deaths',
            'recovered': '$recovered',
            'population': '$population'
          }
        }
      }
    }, {
      '$set': {
        'docs': {
          '$map': {
            'input': {
              '$range': [
                0, {
                  '$size': '$docs'
                }
              ]
            },
            'as': 'idx',
            'in': {
              '$let': {
                'vars': {
                  'yearly': {
                    '$arrayElemAt': [
                      '$docs', {
                        '$max': [
                          0, {
                            '$subtract': [
                              '$$idx', 365
                            ]
                          }
                        ]
                      }
                    ]
                  },
                  'weekly': {
                    '$arrayElemAt': [
                      '$docs', {
                        '$max': [
                          0, {
                            '$subtract': [
                              '$$idx', 7
                            ]
                          }
                        ]
                      }
                    ]
                  },
                  'this': {
                    '$arrayElemAt': [
                      '$docs', '$$idx'
                    ]
                  }
                },
                'in': {
                  'date': '$$this.date',
                  'country': '$$this.country',
                  'state': '$$this.state',
                  'county': '$$this.county',
                  'confirmed': '$$this.confirmed',
                  'deaths': '$$this.deaths',
                  'recovered': '$$this.recovered',
                  'weekly_confirmed': {
                    '$subtract': [
                      '$$this.confirmed', '$$weekly.confirmed'
                    ]
                  },
                  'weekly_deaths': {
                    '$subtract': [
                      '$$this.deaths', '$$weekly.deaths'
                    ]
                  },
                  'yearly_confirmed': {
                    '$subtract': [
                      '$$this.confirmed', '$$yearly.confirmed'
                    ]
                  },
                  'population': '$$this.population'
                }
              }
            }
          }
        }
      }
    }, {
      '$unwind': '$docs'
    }, {
      '$replaceRoot': {
        'newRoot': '$docs'
      }
    }, {
      '$group': {
        '_id': _id,
        'confirmed': {
          '$sum': '$confirmed'
        },
        'deaths': {
          '$sum': '$deaths'
        },
        'recovered': {
          '$sum': '$recovered'
        },
        'weekly_confirmed': {
          '$sum': '$weekly_confirmed'
        },
        'weekly_deaths': {
          '$sum': '$weekly_deaths'
        },
        'yearly_confirmed': {
          '$sum': '$yearly_confirmed'
        },
        'population': {
          '$sum': '$population'
        }
      }
    }, {
      '$set': {
        'active': {
          '$subtract': [
            {
              '$subtract': [
                '$confirmed', '$deaths'
              ]
            }, '$recovered'
          ]
        },
        'incidence': {
          '$divide': [
            {
              '$multiply': [
                '$yearly_confirmed', 100000
              ]
            }, '$population'
          ]
        },
        'fatality': {
          '$cond': [
            {
              '$eq': [
                '$confirmed', 0
              ]
            }, 0, {
              '$multiply': [
                {
                  '$divide': [
                    '$deaths', '$confirmed'
                  ]
                }, 100
              ]
            }
          ]
        }
      }
    }, {
      '$sort': {
        '_id.date': 1
      }
    }
  ]

  redis_get(key)
    .then(async (result) => {
      if (result) {
        res.status(200).json(JSON.parse(result));
        throw `Caught '${key}' in cache`;
      } else {
        // get data from database and store in cache
        return covid19.collection("global_and_us").aggregate(pipeline, { allowDiskUse:true }).toArray()
      }
    })
    .then(async (result) => {
      // store the result from mongodb to cache
      redis_client.setex(key, 28800, JSON.stringify({ source: "Redis Cache", result: result }));

      // send the retult back to client
      return res.send({ source: "Mongodb", result: result });
    })
    .catch(err => console.log(err))
})

That looks very neat!! Would you mind sharing this into the MongoDB Community forum here & we can continue the discussion there? https://developer.mongodb.com/community/forums/t/devhub-a-free-rest-api-for-johns-hopkins-university-covid-19-dataset/8915

It's kinda hidden here and nobody can see this awesome pipeline! I think some countries are missing in your list ;) !

Yea sure, I will write a specific comment with the context behind this code when I have time. I am quite busy at school these days haha 😂.

mongodb-developer / open-data-covid-19

Missing data when group #10