psu-libraries / library_data_services

MIT License
2 stars 1 forks source link

Difficulties with organizing and modeling data. #17

Open olendorf opened 8 years ago

olendorf commented 8 years ago

Sherman Bernard is working on a multi-PI project concerning structural modification to houses in Africa as a means of reducing mosquito populations and reducing the incidence of malaria.

The data comes from several sources, including local census data, their own census and survey, architectural plans, genetic sequences and other biological samples. Different people enter the data, and tend to have "their own way of doing it". Some of the data contains personal information, and going forward, they will have several workers handling the data who, perhaps should not have full access to the personal information.

He has struggled several times with handling data from different sources. In the most recent instance, he has census data pertaining to several levels of organization from Health district through village through households to individuals. The data are collected by others and he has some, but limited influence in how they collect the data. In addition he is collecting other forms of data such as disease prevalence, mosquito density etc. Typically, he keeps all the data in one or more Excel Spreadsheets and uses a variety of methods to enter, edit, manage and merge the data.

Because the people who designed the worksheets, often did so prior to the project the worksheets are not optimized for merging the data. There are many redundancies, and discrepancies between the worksheets. Because the data is hand entered, there are frequent (~10%) errors in the data. Also as the data is corrected, it is impossible to track the changes, and the party responsible for any changes. The result is, a great deal of time is spent merging, shaping and cleaning the data, and he still isn't comfortable that it is all correct.

In this and previous projects he has encountered several recurring problems.

  1. Repetitive data in worksheets
    1. Results in redundancies
    2. Requires editing multiple entries for once fix
  2. Multiple enter data and edit the data
    1. Difficult to track who does what
    2. Difficult to determine where and why errors arise
    3. Loss in validity and trust due to lack of provenance
  3. Similar data entered into multiple worksheets
    1. Results in redundancies
    2. Requires editing multiple entries for once fix
    3. Difficult to merge tables together for analysis

      Other Relevant Personas

Clare Franecki Tao Nguyen Pearl Daugherty IV