senchromatic / topological-data-analysis

Applied topology using abstract simplicial complexes
0 stars 0 forks source link

Kl divergence #35

Closed mfleduc closed 3 years ago

mfleduc commented 3 years ago

Updated KL divergence: Need to take a derivative and the points are not evenly spaced, so I changed the code to make that be accurate.

mfleduc commented 3 years ago

Project_v2 is old code that I never updated, I just started using project_kuroshio. And the print statement was because at one point I was getting weird errors that made it seem like numpy wasn't loading.

As for your suggestion with sym_kl, we probably should do that but I just wanted to get something that worked. -Matt

On Sun, Apr 25, 2021 at 9:36 PM senchromatic @.***> wrote:

@.**** approved this pull request.

Thank you for sharing the new version. Is project_v2.py based on project_kuroshio.py?

In metrics.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619919259 :

@@ -21,13 +21,18 @@ def close_enough(a, b, metric):

Kullback-Leibler divergence, modified to be symmetric

Input: two CDFs of identical shape, e.g. generated from the ecdf function

Note: This is a semi-metric as it doesn't satisfy the triangle inequality

-def sym_kl(cdf1 , cdf2, dx): +def sym_kl(cdf1 , cdf2):

Thanks for patching up this function!

Here's an optimization suggestion: It looks like we're reading in from file each time this function is called. Could we store a local variable (initialized to None), and check whether it's already initialized before loading the data?

In project_first_hack.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619919453 :

 # TODO: investigate why masked_cdfs returned by compute_boxed_cdfs has 1 extra dimension compared to local variable in function

masked_cdfs = masked_cdf[:, 0, :]

+

  • depths = np.genfromtxt('depths.csv')

  • coordvals = np.genfromtxt( 'C:/Users/Matt/Desktop/Masters coursework/topology/project/results/sca depth/2000 pts/'

  • +'boxcoords.csv', delimiter = ',')

  • #
  • masked_latitudes = coordvals[0,:]

  • masked_longitudes = coordvals[1,:]

  • print('There are ' + str(len(masked_latitudes)) + ' lat/lon boxes')

+1 Very useful... I had thought of printing this earlier but forgot haha

In project_v2.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619919601 :

@@ -0,0 +1,244 @@ +## LeDuc, Pereira, Zhang +# This is a first hack at working with the project data using the KL divergence to measure the distance +# between two probability distributions of the depth of minimum sound speed. +import numpy as np +import pandas as pd +import pylab as pl # This gets used a lot I promise +from abstract_simplicial_complex import Point, Simplex, vietoris_rips +from metrics import ks_test +from random import sample, seed +from scipy.interpolate import interp1d +from statfuncs import ecdf +print('AAAAAAAAAA')

?

In project_v2.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619919855 :

@@ -0,0 +1,244 @@ +## LeDuc, Pereira, Zhang

If this file contains the same functions as the other one, could we import the functions to minimize code duplication (for sake of maintenance)?

In project_v2.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619920562 :

  • geographic_names = generate_geographic_names(masked_latitudes, masked_longitudes)
  • point_cloud = create_point_cloud(geographic_names, masked_cdfs, ks_test)
  • a = MIN_SIGNIFICANCE_LEVEL
  • c_a = np.sqrt(-np.log(a/2)*0.5)
  • Value the metric needs to exceed to reject the ks test null hypothesis at the given significance level

  • critical_value = c_a * np.sqrt(2 / len(depths))
  • dr = 0.01
  • radii = np.arange(minRadius, 0.46, dr)
  • After ~ r = 0.29 the KS test rejects F_1=F_2 at the .05 level. Do we care much about what happens past there?

  • At that point the dists are statistically disimilar so grouping them may not be meaningful.

  • First pass it appears that all the cool stuff happens around there

  • so maybe we do

  • homologies = np.zeros( [2, len(radii)] )
  • for rndx in range(len(radii)):#want the index so we can store the dims of homologies

Why not use filtration instead of vietoris_rips?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/senchromatic/topological-data-analysis/pull/35#pullrequestreview-644210482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS4DJGA2KAXNRAMXRNBEWWDTKS7QLANCNFSM43R3HLJQ .