The reasons for the performance gap in the calibration-free approach do make sense to me. However, from a computer scientist's perspective, the biggest reason is simply the low number of subjects. Having a low number of different datapoints causes the deep learning model to overfit. This is what is happening with the calibration-based testing set.
Having many sequences per subject is helpful. However, a few sequences per subject should be enough. They would be especially useful if those sequences are taken in different activity states and therefore have a strong blood pressure variability [Mukkamala et al. “Step 2” at page 20 and Figure 8].
Referring to Figure 2 in your paper, where you compare #usable subjects to #sequences per subject in MIMIC-III and VitalDB:
Because of this reasoning, I would rather have 3500 subjects times 200 sequences than 2700 subjects times 400 sequences, even if the resulting number of sequences is smaller.
The calibration-free performance for different numbers of subjects, given your approach to extract them, could be a valuable experiment.
Since I have not seen other papers reviewing and bothering about the calibration-based/-free difference, I thought it would be great to hear your thoughts about this.
The reasons for the performance gap in the calibration-free approach do make sense to me. However, from a computer scientist's perspective, the biggest reason is simply the low number of subjects. Having a low number of different datapoints causes the deep learning model to overfit. This is what is happening with the calibration-based testing set.
Having many sequences per subject is helpful. However, a few sequences per subject should be enough. They would be especially useful if those sequences are taken in different activity states and therefore have a strong blood pressure variability [Mukkamala et al. “Step 2” at page 20 and Figure 8].
Referring to Figure 2 in your paper, where you compare #usable subjects to #sequences per subject in MIMIC-III and VitalDB: Because of this reasoning, I would rather have 3500 subjects times 200 sequences than 2700 subjects times 400 sequences, even if the resulting number of sequences is smaller. The calibration-free performance for different numbers of subjects, given your approach to extract them, could be a valuable experiment.
Since I have not seen other papers reviewing and bothering about the calibration-based/-free difference, I thought it would be great to hear your thoughts about this.