To get rid of the interviewers voice, snip the audio segment that belongs to the subject. The transcripts have timestamps that indicate which speaker was speaking at which time. You can use a tool like ffmpeg on the linux command line to snip the audio at the timestamps you desire, or you can load the audio and transcript timestamps in python, and snip it programmatically that way.
For higher order statistics from the covarep features, you can calculate statistics like mean, max, min, median, kurtosis, skew over an array of feature values. The array will become a scalar value representing a statistic.
```
import numpy as np
```

frame-level statistic of a feature

covarep_feature_1 = [0.1, 0.1, 0.3, 1.3, 0.5]

higher-order statistic

mean = np.mean(covarep_feature_1)



I hope that clarifies it.