globaldothealth / list

Repository for Global.health: a data science initiative to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.
MIT License
39 stars 7 forks source link

How to deal with India duplicates #1354

Open z023 opened 3 years ago

z023 commented 3 years ago

I have worked out where the duplicates are but it's not quite a simple fix. When I download the ingested data the ones with duplicates seem to have 2 different outputs in the events column which I have highlighted in an example below. The longer output is the correct one and the ID of the duplicate seems to just be given an ID number beginning with P. But this doesn't mean that all the shorter outputs are wrong as there are some IDs which only have a shorter output in the events column. It also doesn't mean that all the IDs beginning with P are wrong either. Also, unhelpfully, there are a few duplicates in the original dataset where the same ID has been given to 2 different people. These will probably have to be manually altered/deleted.

Spoke to AB about it and he remembered that the source had messier data early on, and that the format and quality stabilised over time. So subsequent CSVs might be ok but can't check until the backfill option is up and running.

I'm not sure if the ingestion duplicates can be fixed in the parser or might have to be manually deleted. I'm not sure what the difference is between the longer and shorter outputs in the events column.

Screenshot 2020-10-29 at 20 15 45

AnyaLindstromBattle commented 3 years ago

Ok so to me there seems to be two issues here.

The first involves duplicates where one includes a hospital admission and the date of hospital admission, whereas the other only includes whether there was a hospital admission or not. A few questions/points here:

(a) how confident can we be that all of these are actually duplicates? For the two you highlighted, I checked the notes column and these two cases are the only ones with entry 'Travelled from Italy PayTm Emp, ' so I think we can be pretty confident they are duplicates. But is this the same for all? (b) If these duplicates all just vary in the events column but have the exact same notes, we could implement some logic where we only ingest a case if there isn't already a case which looks the exact same bar the events (and make sure the case that we do ingest is the one with the extra information regarding the hospitalization date). In terms of where in the pipeline (as in, in the parser or somewhere else) we would implement this I'm not sure, perhaps @iamleeg or @calremmel could help? I'm also not sure how doable in terms of computation it would be as it would involve scanning all the existing cases before adding a new one each time, which may slow the process down too much.

The second issue is the duplicate IDs given to several people. I think for this, as you say, you will probably have to delete manually.

z023 commented 3 years ago

The age, gender, location and notes are exactly the same so I'm very confident they are duplicates. I went through the first India .csv and identified which duplicate ID is for which case ID.

There are 13 duplicates in the original first .csv dataset. There are 672 ingestion duplicates.

AnyaLindstromBattle commented 3 years ago

Cool, thanks, so if they really are duplicates we should look into a strategy for removing them. One option is at the ingestion stage as I mentioned above but I think I would need some input from @iamleeg or @calremmel on this. Maybe we can have a discussion about this and whether it seems like a feasible solution or whether manual curation may actually be the best way forward. Will follow this up on slack to decide on a suitable time.

iamleeg commented 3 years ago

Agreed this will need some subtle thinking as it potentially changes the design of parsers quite fundamentally (they might need a history to decide whether they've already "seen" a case, which has memory/time implications).

AnyaLindstromBattle commented 3 years ago

@outbreakprepared thought it may be useful to bring you into this conversation. We've had a few discussions on slack so just to update you:

Zoe brought to my attention some issues regarding data duplicates in the India dataset. Briefly, it seems there are duplicates where all fields are identical apart from the events where one of the duplicates provides a date of hospitalization whereas the other doesn't. The question is how to deal with these; manual curation is a an option but not super ideal as there are quite a few duplicate cases. The other option is to implement a step somewhere in the parsing/ingestion pipeline where, for example, a case is only ingested if its duplicate (bar the date of hospitalization field) isn't already present. We would also, I assume, want the case with most information actually ingested i.e. the one with the date of hospitalization present. However I am unclear about (a) is this feasible and (b) if so, where in the pipeline would this be implemented? i.e. in the actual parser or somewhere else?

@calremmel pointed out that he was trying to find a solution to this as it would be useful in more cases than just this one. In particular he is trying to think through where in the process cases marked by curators for exclusion should be filtered out.

@iamleeg noted that it'd be interesting to look at some demographic info about affected regions to work out what probability that two rows that 'look the same' are likely to be duplicates in the data.

Do you have any thoughts about this?