Preprocessing: Calculate Euclidean Distance of rows

Lumik7 commented 6 years ago

@rmitsch @MoBran I'm trying to implement the euclidean distance calculation, but I'm not quite sure what the right solution is. I'm only focusing on implementing the euclidean distance for the "n2" case, see assignment p.8 the bullet point: "On the one hand you calculate the difference of the norm of the 3dim signal ".

I understand the task like that: Assume we have a matrix with n rows and m columns. For each row we calculate the distance to the other rows. This means we get n*(n-1) new rows of distances for just one row, this results in a massive table. Is this the right approach?

rmitsch commented 6 years ago

I assume in your example one row corresponds to one segment of 30s and the columns are the accerelometer values for each time interval?

To my understanding we have to calculate the Euclidean between each time interval i in the 30s timeframe. That results in a distance vector and thus in a matrix of individual distances as described by you.

We still lack a distance measure between two segments as a whole though, so we have to transform the distance vector into one scalar. The task description doesn't seem to say how to do that. My guess is it we we are supposed to calculate the L2 norm, since other feature representations (median, mean, ...) are mentioned under "Optionally".

If it's up to us to decide, we could do something as as simple as min/max/avg/median/L2 norm or smth. more sophisticated like Hellinger distance (or any other distance measure between sets of values).

Lumik7 commented 6 years ago

Yes your assumption is right, but each column is the n2 value of the accelerometer time series.

To my understanding we have to calculate the Euclidean between each time interval i in the 30s timeframe. That results in a distance vector and thus in a matrix of individual distances as described by you.

My understanding was that we have to calculate the distance of each 30 second snippet to each other.

We still lack a distance measure between two segments as a whole though, so we have to transform the distance vector into one scalar.

Could you elaborate on why we have to transform it to one scalar?

If it's up to us to decide, we could do something as as simple as min/max/avg/median/L2 norm or smth. more sophisticated like Hellinger distance (or any other distance measure between sets of values).

When implementing it I would start with the simplest case, which can then be extended if we have time.

rmitsch commented 6 years ago

My understanding was that we have to calculate the distance of each 30 second snippet to each other.

Yes. The paragraph you quoted says to calculate a "time series of Euclidean distances" though - we have to start with distances between individual values in the time series (i. e. distance of value in snippet A at time = x and value in snippet B at time = x) before calculating the overall distance between A and B. That's my take on this at least.

Could you elaborate on why we have to transform it to one scalar?

Because we need one single scalar to act as our distance criterion during clustering.

When implementing it I would start with the simplest case, which can then be extended if we have time.

Agreed.

Lumik7 commented 6 years ago

Because we need one single scalar to act as our distance criterion during clustering.

Does this mean, that we end up with a data frame of the same dimensionality? Sorry maybe I'm just "standing on the line" and am overcomplicating things ;)

rmitsch commented 6 years ago

If we start with a n x m matrix (n 30s-segments with 30 20 datapoints), we calculate the distance between each trip pairing A, B. To do that, we calculate the distance between the datapoints with the same time value in A and in B. That results in one vector of length m (30 20) for each pairing. We don't need to store of all of those values though - since ultimately we want to calculate the distance (as a scalar) for each pairing A, B, we aggregate that distance vector (e. g. with the L2 norm) to a single scalar.

That means that we end up with one scalar per segment pair. The total number of segment pairings is n! . We can store the distance value in a matrix (not memory-efficient, since only ~ half of all cells are needed, but comfortable) of size n x n.

Is it clearer now? Sorry if I misunderstood you. It's actually quite straightforward, we're probably just having communication trouble here :-D

Lumik7 commented 6 years ago

Thanks for the detailed explanation. I think I will be able to finish this until tonight. It would be great if you could check my code then.

rmitsch commented 6 years ago

Looks good! Suggestions for next steps:

Distance measures:
- Implementation of Euclidean distance for sum of x, y, z norms calculated individually.
- Implementation of dynamic time warping.
- Evalute other distance measures (Hellinger, cosine, L-x norms...).
I'd like to do a grid search to see if we can get better results out of t-SNE.
Feature engineering (e. g. max. speed/acceleration, median, ...).

If you agree (or have suggestions), I'll open up issues for those next steps.

Lumik7 commented 6 years ago

Yeah, agree. If you want you can open new issues for each suggestion.

rmitsch commented 6 years ago

Think we can close this issue?

Lumik7 commented 6 years ago

yes

univie-datamining-team3 / assignment2

Preprocessing: Calculate Euclidean Distance of rows #18