hipercog / ctap

Computational Testing for Automated Preprocessing - a Matlab toolbox extending EEGLAB functionality for batch processing of EEG
Other
22 stars 10 forks source link

Re-writing export_features_CTAP to more complete database structure? #7

Open janbrogger opened 6 years ago

janbrogger commented 6 years ago

I am wondering whether it might be helpful to rewrite export_features_CTAP. There is no table for segments or patients or studies, which means that there is a lot of duplication and the results table becomes one "megatable". What do you think?

Feature request to export_features_CTAP:

  1. Add table creation statements for new tables: patients, studies, segments, measurements to Sqlite database
  2. For each exported feature/value 2a. Let the export code for each feature/value check whether there already is a patient, study or segment that corresponds to this patient, study or segment. Store patientId, studyId, segmentId for use in the next step. 2b. Create a new measurement in the measurement table which had keys to patientId, studyId, segmentId.
jutako commented 6 years ago

The current implementation of database export has identical tables in ctap/master and ctap/dev. The ctap/dev branch produces a separate database file for each feature group, whereas ctap/master produces just a single file.

The structure is: sqlite> .schema CREATE TABLE subject (subjectnr INTEGER PRIMARY KEY, subjectstr TEXT, sex TEXT, age REAL); CREATE TABLE measurement (measurement TEXT PRIMARY KEY, session TEXT, description TEXT); CREATE TABLE results (id INTEGER PRIMARY KEY AUTOINCREMENT, subjectnr INTEGER, measurement TEXT, channel TEXT, variable TEXT, value REAL, timestamp TEXT, duration REAL, latency REAL);

which already contains fields for subjects, measurements and results (feature values). Possible linking of these tables is left to the user.

To me the main issues are related to 'results' table:

  1. results table does not contain unique measurement id for each row, e.g., casename. "session" information is missing.
  2. the calculation segment is rather verbosely documented ("timestamp", "duration", "latency"): it might be possible to just have an id and a separate table for the segments. Each unique measurement needs its own segments, so linking needs both segment_id and casename.
  3. the results table is in long format which is convenient to work with (e.g., in R) but wastes storage space

Issue 1. needs to be resolved but the other two seem more like fine tuning to me.

Can anyone give examples about how large sqlite files are produced with the current setup?

Note also that base Matlab lacks tools to work with databases. To make fancy things we'd need decent tools to work with databases.