This repository contains the code to replicate the data processing, modeling and reporting of our Holistic AI in Medicine (HAIM) Publication in Nature Machine Intelligence (Soenksen LR, Ma Y, Zeng C et al. 2022).
This might not be the right place for this issue, as it is about the data that you published on PhysioNet rather than the code you published here, so I would like to apologize in advance for misusing GitHub to bring this up:
I am having trouble with loading the cxr_ic_fusion_1103.csv file, i.e. the extracted HAIM embeddings, from your PhysioNet repository (https://doi.org/10.13026/3f8d-qe93), in particular with the last two lines:
Both of the last two lines hold 7173 entries, while all others hold 6405 entries. In other words, there are 768 entries more in the last two lines than in all others.
Moreover, both of the last two lines hold three consecutive runs of exactly repeating elements, starting from index 13 (zero-based) and having a length of 768 entries each, with no gaps (so the starting indices of the repetitions are 781 and 1549, respectively).
My first guess would have been that one embedding vector has been repeated accidentally, but this does not make sense as (1) there are three repetitions of 768 elements in each of the two lines, while the lines in total are only 768 elements longer and (2) the starting position at index 13 does not make any sense semantically if one looks at the header (line 0).
So my questions are: (1) Is this a known problem? (2) Is there anything that I can do to reconstruct the last two lines if I want to use all embeddings, or should I just ignore the last two lines? I checked the SHA256 hash of the file by the way, so the download should have not caused the problem.
Update: Just to clarify, by "exactly repeating elements" I do not mean that the entries at indices 13, 14, 15, … all have the same value, but that the entry at index 13 has the same value as the entries at index 781 and 1549, the entry at index 14 has the same value as the entries at index 782 and 1550, and so on.
This might not be the right place for this issue, as it is about the data that you published on PhysioNet rather than the code you published here, so I would like to apologize in advance for misusing GitHub to bring this up:
I am having trouble with loading the
cxr_ic_fusion_1103.csv
file, i.e. the extracted HAIM embeddings, from your PhysioNet repository (https://doi.org/10.13026/3f8d-qe93), in particular with the last two lines:My first guess would have been that one embedding vector has been repeated accidentally, but this does not make sense as (1) there are three repetitions of 768 elements in each of the two lines, while the lines in total are only 768 elements longer and (2) the starting position at index 13 does not make any sense semantically if one looks at the header (line 0).
So my questions are: (1) Is this a known problem? (2) Is there anything that I can do to reconstruct the last two lines if I want to use all embeddings, or should I just ignore the last two lines? I checked the SHA256 hash of the file by the way, so the download should have not caused the problem.
Update: Just to clarify, by "exactly repeating elements" I do not mean that the entries at indices 13, 14, 15, … all have the same value, but that the entry at index 13 has the same value as the entries at index 781 and 1549, the entry at index 14 has the same value as the entries at index 782 and 1550, and so on.