ropensci-archive / cleanEHR

:warning: ARCHIVED :warning: Essential tools and utility functions to facilitate the data processing pipeline, data cleaning and data analysing of clinical data from CC-HIC
GNU General Public License v3.0
54 stars 23 forks source link

Number of episode not matching #60

Closed sinanshi closed 7 years ago

sinanshi commented 8 years ago

IDSH = 15444(3) Local = 15677

dpshelio commented 8 years ago

On IDHS: The total number of files there is 11 so far.

from Nfiles file name subjects
Oxford 1 Oxford_02042016 798
Cambridge 2 CUH_07042016 147
CUHJF02062016 174
GSTT 1 GSTT_31032016 8815
UCLH 1 20160217a_ReId_CC_UCLH 4405
Imperial 6 NIHRHIC_CC_8.3.2_042014-082014 170
092014-122014 224
012015-042015 244
052015-082015 224
092015-102015 126
112015-122015 116
Total 15443

This numbers has been obtained from the xml files with the following command

# Numbers of `subjects` per file
find ./ -iname "*.xml" -exec grep -cH "</d:subject>\|</subject>" {} \; 
# Total number of `subjects` (sum of all)
find ./ -iname "*.xml" -exec grep -c "</d:subject>\|</subject>" {} \; | paste -sd+ - | bc

the all_patients.RData contains an extra one. This is due to an additional file created when broken into bits that seem it was empty.

I'm going through the files available in group_data/Live data to see whether we are missing something, and I'm very confused. For almost all the places we have multiple sets, whether they are superseed by the new ones I don't know. Also, names are confusing - maybe @NicolaCooper could help me in here.

Cambridge

  1. CUH20160407 with a single file
  2. CUH20160407_JohnFarman with a single file
  3. ICU16102015XML with 8 files named as: ICU_x.xml

we were processing 1 and 2, I'm adding now the ones in 3 which contains 392 subjects

Imperial

  1. Imperial NIHR HIC ICM Test Data (I assume this is not needed)
  2. NIHRHIC_CC_8.3.2_ICHNT_15022016; with files like: NIHRHIC_CC_8.3.2_mmYYYY-mmYYYY.xml and range: Apr 2014 - Dec 2015
  3. NIHRHIC_CC_8.3.2_ICHNT_23112015; with files like: ICHNT_ICM_mmYYYY-mmYYYY.xml and range: Apr 2014 - Dec 2015
  4. NIHRHIC_CC_8.3.2_ICHNT_15022016; with files like: NIHRHIC_CC_8.3.2_mmYYYY-mmYYYY.xml and range: Apr 2014 - Dec 2015
  5. NIHRHIC_CC_8.3.2_RYJ_19042015_Feb2014_TO_Jan2015; with files like NIHRHIC_CC_8.3.2_RYJ_19042015-mmYYYY.xml and range: Feb 2014 - Jan 2015

In this case I'm running the ones from 4, as they look like the most modern. However, 5 contains data (at least I'm inferring so from the filenames) from Feb 2014, whereas 4 starts in April 2014.

GSTT

There are three directories, changing the bit after CC from their name:

  1. NIHRHIC_CC_1.1_GSTT_13032015.xml
  2. NIHRHIC_CC_1.3_GSTT_31122015.xml
  3. NIHRHIC_CC_1.1_GSTT_31032016.xml

We've been using 3. I assume 1 and 2 are superseded by 3.

Oxford

This contains four sets:

  1. 20150428 - NIHRHIC-CC_8.3.2_Oxford 28042015.xml
  2. 20151105 - NIHRHIC-CC_8.3.2_Oxford 05112015.xml
  3. 20151112 - NIHRHIC-CC_8.3.2_Oxford 12112015.xml
  4. 20160402 - NIHRHIC-CC_8.3.2_Oxford 02042016.xml

We've been using number 4, but maybe we should use them all?

UCLH

Here I've got the same problem than above, do new extracts superseed the previous ones, or the have to be added to the previous?

  1. UCLH_Extract1_150216 with a file CC (extract 1).xml
  2. UCLH_Extract2_150812 with:
    • part1_2610_NHIC_ICU_v.8.3.2.xml and part1_2610_anon_NHIC_ICU_v.8.3.2.xml
    • part2_2610_NHIC_ICU_v.8.3.2.xml and part2_2610_anon_NHIC_ICU_v.8.3.2.xml
  3. UCLH_Extract3_160217 with 20160217a_ReId_CC.xml
  4. UCLH_Extract4_160524 with 20160524_anon_ReId.xml

Which one we should be using? we were using number 3, but maybe we should be using all of them.

Suggestion

I would like to come up with some sort of naming standard for the files to be used over all the files, at the moment dates are completely in a mix of formats (e.g. Feb2014, 201402, 022014, 0214, etc.) and files does not contain a proper naming that makes easy to automate sorting and understanding of their content. @jamespjh @docsteveharris @sinanshi @NicolaCooper is this something we could aim to sort out? Does it make sense what I'm saying? (I've just spent more than 2 hours trying to find some missing episodes and I'm in a situation I don't know whether we are missing or having duplicates). Since the process of parsing them take quite a bit, the de-duplication cannot be done till everything is parsed, and parsing extra files could be avoided if the name convention between the different sites is kept with some order. I'm happy to come up with some standard rule myself, but there's quite a lot of things that I don't understand (differences in versions, differences in extracts, differences in names, ...)