jp43 / LSDMap

Package to perform Locally-Scaled Diffusion Map
Other
2 stars 3 forks source link

Saving and communicating large arrays #1

Closed ajkluber closed 9 years ago

ajkluber commented 9 years ago

I want to propose a couple improvements that might help in applications to larger matrices.

  1. numpy's save (load) instead of savetxt (loadtxt), much faster to write/read from binary format.
  2. mpi4py's uppercase methods comm.Scatter, comm.Gather, etc. (versus their lowercase counterparts). The lowercase methods use pickle to serialize the inputs for communication and this limits them to communicating objects <=2GB. On the other hand the uppercase methods are intended for communicating numpy arrays.

I'm willing to work on making these changes, at least in the lsdmap subpackage.

The reason I am proposing these changes is that I am trying to compute lsdmap on a trajectory of 1.5E6 frames by a combination of downsampling, lsdmap, and embedding using rbf.

jp43 commented 9 years ago

Hi Alex,

Please go ahead it looks like great improvements, let me know if you have any question when doing the changes,

Best Jordane

On Sun, Apr 12, 2015 at 5:10 PM, Alexander Kluber notifications@github.com wrote:

I want to propose a couple improvements that might help in applications to larger matrices.

  1. numpy's save (load) instead of savetxt (loadtxt), much faster to write/read from binary format.
  2. mpi4py's uppercase methods comm.Scatter, comm.Gather, etc. (versus their lowercase counterparts). The lowercase methods use pickle to serialize the inputs for communication and this limits them to communicating objects <=2GB. On the other hand the uppercase methods are intended for communicating numpy arrays.

I'm willing to work on making these changes, at least in the lsdmap subpackage.

The reason I am proposing these changes is that I am trying to compute lsdmap on a trajectory of 1.5E6 frames by a combination of downsampling, lsdmap, and embedding using rbf.

Reply to this email directly or view it on GitHub https://github.com/jp43/lsdmap/issues/1.

Jordane PRETO

Rice University, Anderson Biological Lab, room 319 6100 Main street Houston, Texas, 77005-1892

ajkluber commented 9 years ago

Well, one awkward thing about np.save/np.load is that if you append a file (e.g. saving distance_matrix) then you have to call np.load the same number of times: each call np.load returns a chunk that you appended.

This would require the same number of processors when later loading the distance matrix as when it was saved. Is this worth it?

ajkluber commented 9 years ago

As a crude comparison, for a trajectory of 50,000 frames: np.loadtxt("example.dm") took 70min; filesize is 59GB np.save("example_dm.npy",dm) took 1min; filesize is 19GB