Number of episode not matching

On IDHS: The total number of files there is 11 so far.

from	Nfiles	file name	subjects
Oxford	1	`Oxford_02042016`	798
Cambridge	2	`CUH_07042016`	147
		`CUHJF02062016`	174
GSTT	1	`GSTT_31032016`	8815
UCLH	1	`20160217a_ReId_CC_UCLH`	4405
Imperial	6	`NIHRHIC_CC_8.3.2_042014-082014`	170
		`092014-122014`	224
		`012015-042015`	244
		`052015-082015`	224
		`092015-102015`	126
		`112015-122015`	116
		Total	15443

This numbers has been obtained from the xml files with the following command

# Numbers of `subjects` per file
find ./ -iname "*.xml" -exec grep -cH "</d:subject>\|</subject>" {} \; 
# Total number of `subjects` (sum of all)
find ./ -iname "*.xml" -exec grep -c "</d:subject>\|</subject>" {} \; | paste -sd+ - | bc

the all_patients.RData contains an extra one. This is due to an additional file created when broken into bits that seem it was empty.

I'm going through the files available in group_data/Live data to see whether we are missing something, and I'm very confused. For almost all the places we have multiple sets, whether they are superseed by the new ones I don't know. Also, names are confusing - maybe @NicolaCooper could help me in here.

Cambridge

CUH20160407 with a single file
CUH20160407_JohnFarman with a single file
ICU16102015XML with 8 files named as: ICU_x.xml

we were processing 1 and 2, I'm adding now the ones in 3 which contains 392 subjects

Imperial

Imperial NIHR HIC ICM Test Data (I assume this is not needed)
NIHRHIC_CC_8.3.2_ICHNT_15022016; with files like: NIHRHIC_CC_8.3.2_mmYYYY-mmYYYY.xml and range: Apr 2014 - Dec 2015
NIHRHIC_CC_8.3.2_ICHNT_23112015; with files like: ICHNT_ICM_mmYYYY-mmYYYY.xml and range: Apr 2014 - Dec 2015
NIHRHIC_CC_8.3.2_ICHNT_15022016; with files like: NIHRHIC_CC_8.3.2_mmYYYY-mmYYYY.xml and range: Apr 2014 - Dec 2015
NIHRHIC_CC_8.3.2_RYJ_19042015_Feb2014_TO_Jan2015; with files like NIHRHIC_CC_8.3.2_RYJ_19042015-mmYYYY.xml and range: Feb 2014 - Jan 2015

In this case I'm running the ones from 4, as they look like the most modern. However, 5 contains data (at least I'm inferring so from the filenames) from Feb 2014, whereas 4 starts in April 2014.

GSTT

There are three directories, changing the bit after CC from their name:

NIHRHIC_CC_1.1_GSTT_13032015.xml
NIHRHIC_CC_1.3_GSTT_31122015.xml
NIHRHIC_CC_1.1_GSTT_31032016.xml

We've been using 3. I assume 1 and 2 are superseded by 3.

Oxford

This contains four sets:

20150428 - NIHRHIC-CC_8.3.2_Oxford 28042015.xml
20151105 - NIHRHIC-CC_8.3.2_Oxford 05112015.xml
20151112 - NIHRHIC-CC_8.3.2_Oxford 12112015.xml
20160402 - NIHRHIC-CC_8.3.2_Oxford 02042016.xml

We've been using number 4, but maybe we should use them all?

UCLH

Here I've got the same problem than above, do new extracts superseed the previous ones, or the have to be added to the previous?

UCLH_Extract1_150216 with a file CC (extract 1).xml
UCLH_Extract2_150812 with:
- part1_2610_NHIC_ICU_v.8.3.2.xml and part1_2610_anon_NHIC_ICU_v.8.3.2.xml
- part2_2610_NHIC_ICU_v.8.3.2.xml and part2_2610_anon_NHIC_ICU_v.8.3.2.xml
UCLH_Extract3_160217 with 20160217a_ReId_CC.xml
UCLH_Extract4_160524 with 20160524_anon_ReId.xml

Which one we should be using? we were using number 3, but maybe we should be using all of them.

Suggestion

I would like to come up with some sort of naming standard for the files to be used over all the files, at the moment dates are completely in a mix of formats (e.g. Feb2014, 201402, 022014, 0214, etc.) and files does not contain a proper naming that makes easy to automate sorting and understanding of their content. @jamespjh @docsteveharris @sinanshi @NicolaCooper is this something we could aim to sort out? Does it make sense what I'm saying? (I've just spent more than 2 hours trying to find some missing episodes and I'm in a situation I don't know whether we are missing or having duplicates). Since the process of parsing them take quite a bit, the de-duplication cannot be done till everything is parsed, and parsing extra files could be avoided if the name convention between the different sites is kept with some order. I'm happy to come up with some standard rule myself, but there's quite a lot of things that I don't understand (differences in versions, differences in extracts, differences in names, ...)

ropensci-archive / cleanEHR