swb-ief / etl-pipeline

The Covid Lens
1 stars 10 forks source link

Correction of Delta positivity and 21 day moving average delta positi… #117

Closed talkgarima closed 3 years ago

talkgarima commented 3 years ago

Correction of Delta positivity and 21 day moving average delta positivity values

talkgarima commented 3 years ago

This is a ok patch. However looking at what it means, then it seems there is a data issue. tested > confirmed ... tested can't be 0 when confirmed >0 since its based on a test. We might want to invistigate that in phase II, instead of patching it there as well.

Not sure if we would be able to fix that without going theorgh the route of imputation. It is happening due to inconsistency of data. There are days when no tested figures are reported and then after 3-4 days we get cumulate number. However confirmed figures are consistent without any gap This is why there are times when delta positivity comes as inf

lijinsgithub commented 3 years ago

I happened to see your conversations. How about correcting the discrepancies with a rule based imputation approach? I think both of you already got the answers correctly and adding two rules can improve the issues without introducing a major imputation schemes.

1) for a date of confirmed cases is a zero (I will call it, CC_d = 0) , impute CCd <- CC{d-1} (to prevent a delta TP to be an infinite). 2) for a date if confirmed cases < test positive (I will call it, CC_d < TP_d), correct CC_d = TP_d (to correct a delta TP to be greater than one).

Although I do not know the process of data collection, to me, it is reasonable to assume that test positive reports are more accurate than total test confirmed because positives are more critical numbers. So, if any discrepancy happens between the two statistics, it seems better to honor test positives more than total test confirmed.

Nozziel commented 3 years ago

I happened to see your conversations. How about correcting the discrepancies with a rule based imputation approach? I think both of you already got the answers correctly and adding two rules can improve the issues without introducing a major imputation schemes.

  1. for a date of confirmed cases is a zero (I will call it, CC_d = 0) , impute CCd <- CC{d-1} (to prevent a delta TP to be an infinite).
  2. for a date if confirmed cases < test positive (I will call it, CC_d < TP_d), correct CC_d = TP_d (to correct a delta TP to be greater than one).

Although I do not know the process of data collection, to me, it is reasonable to assume that test positive reports are more accurate than total test confirmed because positives are more critical numbers. So, if any discrepancy happens between the two statistics, it seems better to honor test positives more than total test confirmed.

We'll need to discuss this with the dashboard team as well... for your suggestion for 2 would lead to a positivity rate of 100% Which is fine if the user can see both the positive/tested numbers as well and see why it 'spiked'... if that's not the case I would prefer to set it to NaN (the Inf's are stripped because google sheet can't handle that)

lijinsgithub commented 3 years ago

I meant that to make 0 <= positivity rate <= 1, instead of infinity. Also, your suggestion tracking a different reason with a different symbol like "NaN" sounds great too.

I totally agree that you need to discuss with dashboard team to make a final decision as there may be some other issues to cause the same trouble. I was just wondering why the issues have not been fixed since I think I heard about similar issues in one of the phase I meetings. If you already had some ideas on the issues and I interrupted your plans, sorry for that and please ignore my comments. Thanks for your reply!

Nozziel commented 3 years ago

If you already had some ideas on the issues and I interrupted your plans, sorry for that and please ignore my comments. Thanks for your reply!

Only way to improve and learn is to have discussions like these, so they are always welcome.