ropensci-archive / cleanEHR

:warning: ARCHIVED :warning: Essential tools and utility functions to facilitate the data processing pipeline, data cleaning and data analysing of clinical data from CC-HIC
GNU General Public License v3.0
54 stars 23 forks source link

Issue with table.to.ccdata #154

Closed DocEd closed 2 years ago

DocEd commented 6 years ago

Unable to convert Postgresql data into ccdata for further work with cleanEHR. "Error in mutate_impl(.data, dots) : found duplicated column name : meta"

dpshelio commented 6 years ago

This depends of #148

DocEd commented 6 years ago

I suspect the issue was with not collecting the metadata properly. I'll leave this here though, as this is as yet untested in the safe-haven.

anoopshah commented 6 years ago

This error occurs for data fields with multiple columns. All items which are not the primary_column or time are given the column name 'meta'. In the case of NIHR_HIC_ICU_0187 (organism) there are 3 text fields, of which one is given data type 'item2d' and the other two columns are named 'meta'. What is the standard for naming the additional metadata in the original ccRecord?

anoopshah commented 6 years ago

I will discuss this with @klapaukh

DocEd commented 6 years ago

This remains a problem. I've removed the microbiology fields for the sake of producing a new extract. @klapaukh I suspect it would be more worthwhile working on porting the anonymiser to output an SQLite file, rather than trying to modify the current ccanonym to work with a "legacy" data structure. What do you think?

anoopshah commented 6 years ago

I discussed with @klapaukh - he has combined the microbiology data items into one row with 3 columns to keep information about a sample together (and avoid the need for a sample ID to link this information). However table.to.ccdata does not work because the additional two columns become metadata with the same column name 'meta', and in the ccRecord specification there is no documented way of naming these columns differently. I tried to correct this for the sepsis3 by pre-processing the extract from postgreSQL, splitting the 3 variables so that they each had only one data column.