Repository for Global.health: a data science initiative to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.
There are several approaches and confounds to speed-up ingestion:
[X] Ingest diff's instead of bulk uploads where able [#2975 ].
[ ] Prune database to remove partial upload cases. This has led to a runaway effect where cases are added but never set list=True or removed (since both require a recent successful upload to be identified), leading to further failed uploads. A quick check reveals that over 80% of cases in the DB are not currently accepted (many of these are likely to be duplicates of partial uploads).
[X] Related to the previous point, extend timeouts on failing ingestions to allow a successful completion and trigger self-pruning.
[ ] Some sources (appears to be those without unique identifiers) fail on pruning ('document failed validation'). This appears to have been happening for some time and relates to deleting cases marked list=False from the database.
[ ] Issue #2551 discusses mongoose as problematic in the data service.
[ ] Ongoing situation monitoring, relates to globaldothealth/covid19-ingestion-monitor#4
Also related to: #2551