Sherman Bernard is working on a multi-PI project concerning structural modification to houses in Africa as a means of reducing mosquito populations and reducing the incidence of malaria.
The data comes from several sources, including local census data, their own census and survey, architectural plans, genetic sequences and other biological samples. Different people enter the data, and tend to have "their own way of doing it". Some of the data contains personal information, and going forward, they will have several workers handling the data who, perhaps should not have full access to the personal information.
He has struggled several times with handling data from different sources. In the most recent instance, he has census data pertaining to several levels of organization from Health district through village through households to individuals. The data are collected by others and he has some, but limited influence in how they collect the data. In addition he is collecting other forms of data such as disease prevalence, mosquito density etc. Typically, he keeps all the data in one or more Excel Spreadsheets and uses a variety of methods to enter, edit, manage and merge the data.
Because the people who designed the worksheets, often did so prior to the project the worksheets are not optimized for merging the data. There are many redundancies, and discrepancies between the worksheets. Because the data is hand entered, there are frequent (~10%) errors in the data. Also as the data is corrected, it is impossible to track the changes, and the party responsible for any changes. The result is, a great deal of time is spent merging, shaping and cleaning the data, and he still isn't comfortable that it is all correct.
In this and previous projects he has encountered several recurring problems.
Repetitive data in worksheets
Results in redundancies
Requires editing multiple entries for once fix
Multiple enter data and edit the data
Difficult to track who does what
Difficult to determine where and why errors arise
Loss in validity and trust due to lack of provenance
Sherman Bernard is working on a multi-PI project concerning structural modification to houses in Africa as a means of reducing mosquito populations and reducing the incidence of malaria.
The data comes from several sources, including local census data, their own census and survey, architectural plans, genetic sequences and other biological samples. Different people enter the data, and tend to have "their own way of doing it". Some of the data contains personal information, and going forward, they will have several workers handling the data who, perhaps should not have full access to the personal information.
He has struggled several times with handling data from different sources. In the most recent instance, he has census data pertaining to several levels of organization from Health district through village through households to individuals. The data are collected by others and he has some, but limited influence in how they collect the data. In addition he is collecting other forms of data such as disease prevalence, mosquito density etc. Typically, he keeps all the data in one or more Excel Spreadsheets and uses a variety of methods to enter, edit, manage and merge the data.
Because the people who designed the worksheets, often did so prior to the project the worksheets are not optimized for merging the data. There are many redundancies, and discrepancies between the worksheets. Because the data is hand entered, there are frequent (~10%) errors in the data. Also as the data is corrected, it is impossible to track the changes, and the party responsible for any changes. The result is, a great deal of time is spent merging, shaping and cleaning the data, and he still isn't comfortable that it is all correct.
In this and previous projects he has encountered several recurring problems.
Other Relevant Personas
Clare Franecki Tao Nguyen Pearl Daugherty IV