tiwarylab / amino

Automatic Mutual Information Noise Omission
MIT License
14 stars 3 forks source link

Usage with large datasets #6

Open moldyn-nagel opened 3 years ago

moldyn-nagel commented 3 years ago

Hello,

Your method sounds very promising. I have a data set of 750 input coordinates with each approx. 2 million samples. For using amino_fast.py I am running out of 64GB RAM. Using the kde with amino.py would take almost a year on my machine. Hence, I thought I could simply use an histogram instead of kde. So I replaced

    # Binning two OP's in 2D space
    def d2_bin(self, x, y):
        """ Calculate a joint probability distribution for two trajectories.

        Parameters
        ----------
        x : np.array
            Trajectory of first OP.

        y : np.array
            Trajcetory of second OP.

        Returns
        -------
        p : np.array
            self.bins by self.bins array of joint probabilities from KDE.

        """

        KD = KernelDensity(bandwidth=self.bandwidth,kernel=self.kernel)
        KD.fit(np.column_stack((x,y)), sample_weight=self.weights)
        grid1 = np.linspace(np.min(x),np.max(x),self.bins)
        grid2 = np.linspace(np.min(y),np.max(y),self.bins)
        mesh = np.meshgrid(grid1,grid2)
        data = np.column_stack((mesh[0].reshape(-1,1),mesh[1].reshape(-1,1)))
        samp = KD.score_samples(data)
        samp = samp.reshape(self.bins,self.bins)
        p = np.exp(samp)/np.sum(np.exp(samp))

        return p

simply with

    # Binning two OP's in 2D space
    def d2_bin(self, x, y):
        """ Calculate a joint probability distribution for two trajectories.

        Parameters
        ----------
        x : np.array
            Trajectory of first OP.

        y : np.array
            Trajcetory of second OP.

        Returns
        -------
        p : np.array
            self.bins by self.bins array of joint probabilities from KDE.

        """
        hist, _, _ = np.histogram2d(x, y, bins=self.bins)
        hist = hist.T

        return hist / np.sum(hist)

Is this reasonable for bins=200, or is kde needed for a sensitive distance measure? Thank you for your advice.