Kl divergence - Githubissues

Project_v2 is old code that I never updated, I just started using project_kuroshio. And the print statement was because at one point I was getting weird errors that made it seem like numpy wasn't loading.

As for your suggestion with sym_kl, we probably should do that but I just wanted to get something that worked. -Matt

On Sun, Apr 25, 2021 at 9:36 PM senchromatic @.***> wrote:

@.**** approved this pull request.

Thank you for sharing the new version. Is project_v2.py based on project_kuroshio.py?

In metrics.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619919259 :

@@ -21,13 +21,18 @@ def close_enough(a, b, metric):

Kullback-Leibler divergence, modified to be symmetric

Input: two CDFs of identical shape, e.g. generated from the ecdf function

Note: This is a semi-metric as it doesn't satisfy the triangle inequality

-def sym_kl(cdf1 , cdf2, dx): +def sym_kl(cdf1 , cdf2):

Thanks for patching up this function!

Here's an optimization suggestion: It looks like we're reading in from file each time this function is called. Could we store a local variable (initialized to None), and check whether it's already initialized before loading the data?

In project_first_hack.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619919453 :
 # TODO: investigate why masked_cdfs returned by compute_boxed_cdfs has 1 extra dimension compared to local variable in function
masked_cdfs = masked_cdf[:, 0, :]

+

depths = np.genfromtxt('depths.csv')

coordvals = np.genfromtxt( 'C:/Users/Matt/Desktop/Masters coursework/topology/project/results/sca depth/2000 pts/'

+'boxcoords.csv', delimiter = ',')

#

masked_latitudes = coordvals[0,:]

masked_longitudes = coordvals[1,:]

print('There are ' + str(len(masked_latitudes)) + ' lat/lon boxes')
+1 Very useful... I had thought of printing this earlier but forgot haha

In project_v2.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619919601 :

@@ -0,0 +1,244 @@ +## LeDuc, Pereira, Zhang +# This is a first hack at working with the project data using the KL divergence to measure the distance +# between two probability distributions of the depth of minimum sound speed. +import numpy as np +import pandas as pd +import pylab as pl # This gets used a lot I promise +from abstract_simplicial_complex import Point, Simplex, vietoris_rips +from metrics import ks_test +from random import sample, seed +from scipy.interpolate import interp1d +from statfuncs import ecdf +print('AAAAAAAAAA')

?

In project_v2.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619919855 :

@@ -0,0 +1,244 @@ +## LeDuc, Pereira, Zhang

If this file contains the same functions as the other one, could we import the functions to minimize code duplication (for sake of maintenance)?

In project_v2.py https://github.com/senchromatic/topological-data-analysis/pull/35#discussion_r619920562 :

geographic_names = generate_geographic_names(masked_latitudes, masked_longitudes)

point_cloud = create_point_cloud(geographic_names, masked_cdfs, ks_test)

a = MIN_SIGNIFICANCE_LEVEL

c_a = np.sqrt(-np.log(a/2)*0.5)

Value the metric needs to exceed to reject the ks test null hypothesis at the given significance level

critical_value = c_a * np.sqrt(2 / len(depths))

dr = 0.01

radii = np.arange(minRadius, 0.46, dr)

After ~ r = 0.29 the KS test rejects F_1=F_2 at the .05 level. Do we care much about what happens past there?

At that point the dists are statistically disimilar so grouping them may not be meaningful.

First pass it appears that all the cool stuff happens around there

so maybe we do

homologies = np.zeros( [2, len(radii)] )

for rndx in range(len(radii)):#want the index so we can store the dims of homologies

Why not use filtration instead of vietoris_rips?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/senchromatic/topological-data-analysis/pull/35#pullrequestreview-644210482, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS4DJGA2KAXNRAMXRNBEWWDTKS7QLANCNFSM43R3HLJQ .

senchromatic / topological-data-analysis

Kl divergence #35

Thank you for sharing the new version. Is project_v2.py based on project_kuroshio.py?

Kullback-Leibler divergence, modified to be symmetric

Input: two CDFs of identical shape, e.g. generated from the ecdf function

Note: This is a semi-metric as it doesn't satisfy the triangle inequality

Here's an optimization suggestion: It looks like we're reading in from file each time this function is called. Could we store a local variable (initialized to None), and check whether it's already initialized before loading the data?

masked_cdfs = masked_cdf[:, 0, :]

depths = np.genfromtxt('depths.csv')

coordvals = np.genfromtxt( 'C:/Users/Matt/Desktop/Masters coursework/topology/project/results/sca depth/2000 pts/'

+'boxcoords.csv', delimiter = ',')

masked_latitudes = coordvals[0,:]

masked_longitudes = coordvals[1,:]

+1 Very useful... I had thought of printing this earlier but forgot haha

?

If this file contains the same functions as the other one, could we import the functions to minimize code duplication (for sake of maintenance)?

Value the metric needs to exceed to reject the ks test null hypothesis at the given significance level

After ~ r = 0.29 the KS test rejects F_1=F_2 at the .05 level. Do we care much about what happens past there?

At that point the dists are statistically disimilar so grouping them may not be meaningful.

First pass it appears that all the cool stuff happens around there

so maybe we do