lrsoenksen / HAIM

This repository contains the code to replicate the data processing, modeling and reporting of our Holistic AI in Medicine (HAIM) Publication in Nature Machine Intelligence (Soenksen LR, Ma Y, Zeng C et al. 2022).
Apache License 2.0
104 stars 27 forks source link

Broken embeddings file on PhysioNet? #15

Open spezold opened 5 months ago

spezold commented 5 months ago

This might not be the right place for this issue, as it is about the data that you published on PhysioNet rather than the code you published here, so I would like to apologize in advance for misusing GitHub to bring this up:

I am having trouble with loading the cxr_ic_fusion_1103.csv file, i.e. the extracted HAIM embeddings, from your PhysioNet repository (https://doi.org/10.13026/3f8d-qe93), in particular with the last two lines:

My first guess would have been that one embedding vector has been repeated accidentally, but this does not make sense as (1) there are three repetitions of 768 elements in each of the two lines, while the lines in total are only 768 elements longer and (2) the starting position at index 13 does not make any sense semantically if one looks at the header (line 0).

So my questions are: (1) Is this a known problem? (2) Is there anything that I can do to reconstruct the last two lines if I want to use all embeddings, or should I just ignore the last two lines? I checked the SHA256 hash of the file by the way, so the download should have not caused the problem.

Update: Just to clarify, by "exactly repeating elements" I do not mean that the entries at indices 13, 14, 15, … all have the same value, but that the entry at index 13 has the same value as the entries at index 781 and 1549, the entry at index 14 has the same value as the entries at index 782 and 1550, and so on.