Closed sinanshi closed 7 years ago
On IDHS: The total number of files there is 11 so far.
from | Nfiles | file name | subjects |
---|---|---|---|
Oxford | 1 | Oxford_02042016 |
798 |
Cambridge | 2 | CUH_07042016 |
147 |
CUHJF02062016 |
174 | ||
GSTT | 1 | GSTT_31032016 |
8815 |
UCLH | 1 | 20160217a_ReId_CC_UCLH |
4405 |
Imperial | 6 | NIHRHIC_CC_8.3.2_042014-082014 |
170 |
092014-122014 |
224 | ||
012015-042015 |
244 | ||
052015-082015 |
224 | ||
092015-102015 |
126 | ||
112015-122015 |
116 | ||
Total | 15443 |
This numbers has been obtained from the xml
files with the following command
# Numbers of `subjects` per file
find ./ -iname "*.xml" -exec grep -cH "</d:subject>\|</subject>" {} \;
# Total number of `subjects` (sum of all)
find ./ -iname "*.xml" -exec grep -c "</d:subject>\|</subject>" {} \; | paste -sd+ - | bc
the all_patients.RData
contains an extra one. This is due to an additional file created when broken into bits that seem it was empty.
I'm going through the files available in group_data/Live data
to see whether we are missing something, and I'm very confused. For almost all the places we have multiple sets, whether they are superseed by the new ones I don't know. Also, names are confusing - maybe @NicolaCooper could help me in here.
CUH20160407
with a single fileCUH20160407_JohnFarman
with a single fileICU16102015XML
with 8 files named as: ICU_x.xml
we were processing 1 and 2, I'm adding now the ones in 3 which contains 392 subjects
Imperial NIHR HIC ICM Test Data
(I assume this is not needed)NIHRHIC_CC_8.3.2_ICHNT_15022016
; with files like: NIHRHIC_CC_8.3.2_mmYYYY-mmYYYY.xml
and range: Apr 2014 - Dec 2015NIHRHIC_CC_8.3.2_ICHNT_23112015
; with files like: ICHNT_ICM_mmYYYY-mmYYYY.xml
and range: Apr 2014 - Dec 2015NIHRHIC_CC_8.3.2_ICHNT_15022016
; with files like: NIHRHIC_CC_8.3.2_mmYYYY-mmYYYY.xml
and range: Apr 2014 - Dec 2015NIHRHIC_CC_8.3.2_RYJ_19042015_Feb2014_TO_Jan2015
; with files like NIHRHIC_CC_8.3.2_RYJ_19042015-mmYYYY.xml
and range: Feb 2014 - Jan 2015In this case I'm running the ones from 4, as they look like the most modern. However, 5 contains data (at least I'm inferring so from the filenames) from Feb 2014, whereas 4 starts in April 2014.
There are three directories, changing the bit after CC
from their name:
NIHRHIC_CC_1.1_GSTT_13032015.xml
NIHRHIC_CC_1.3_GSTT_31122015.xml
NIHRHIC_CC_1.1_GSTT_31032016.xml
We've been using 3. I assume 1 and 2 are superseded by 3.
This contains four sets:
20150428
- NIHRHIC-CC_8.3.2_Oxford 28042015.xml
20151105
- NIHRHIC-CC_8.3.2_Oxford 05112015.xml
20151112
- NIHRHIC-CC_8.3.2_Oxford 12112015.xml
20160402
- NIHRHIC-CC_8.3.2_Oxford 02042016.xml
We've been using number 4, but maybe we should use them all?
Here I've got the same problem than above, do new extracts superseed the previous ones, or the have to be added to the previous?
UCLH_Extract1_150216
with a file CC (extract 1).xml
UCLH_Extract2_150812
with:
part1_2610_NHIC_ICU_v.8.3.2.xml
and part1_2610_anon_NHIC_ICU_v.8.3.2.xml
part2_2610_NHIC_ICU_v.8.3.2.xml
and part2_2610_anon_NHIC_ICU_v.8.3.2.xml
UCLH_Extract3_160217
with 20160217a_ReId_CC.xml
UCLH_Extract4_160524
with 20160524_anon_ReId.xml
Which one we should be using? we were using number 3, but maybe we should be using all of them.
I would like to come up with some sort of naming standard for the files to be used over all the files, at the moment dates are completely in a mix of formats (e.g. Feb2014, 201402, 022014, 0214, etc.) and files does not contain a proper naming that makes easy to automate sorting and understanding of their content. @jamespjh @docsteveharris @sinanshi @NicolaCooper is this something we could aim to sort out? Does it make sense what I'm saying? (I've just spent more than 2 hours trying to find some missing episodes and I'm in a situation I don't know whether we are missing or having duplicates). Since the process of parsing them take quite a bit, the de-duplication cannot be done till everything is parsed, and parsing extra files could be avoided if the name convention between the different sites is kept with some order. I'm happy to come up with some standard rule myself, but there's quite a lot of things that I don't understand (differences in versions, differences in extracts, differences in names, ...)
IDSH = 15444(3) Local = 15677