jp43 / LSDMap

Package to perform Locally-Scaled Diffusion Map
Other
2 stars 3 forks source link

LSDMap fails for >64K configurations #3

Open vivek-bala opened 9 years ago

vivek-bala commented 9 years ago

I am using the LSDMap installed on Stampede.

I ran lsdmap for a file with 64K configurations. Script: https://gist.github.com/vivek-bala/954b24a694b52d79350e It failed with the following error: https://gist.github.com/vivek-bala/312e87d00e1e5273e79d

Whereas it is successful for <=32K configurations.

Am I making a mistake some place ? Maybe the LSDMap is old by several versions, it seems to have been updated in Jan of this year.

jp43 commented 9 years ago

Hi Vivek, it seems to be an internal MPI-related error, I remember we had a MPI issue some times ago on Stampede, or was it on Archer? For some reasons, it was running OK with a smaller number of CPUs. Could you try running your command with more or less CPU's to see if you still get the same issue (e.g. 16, 32 and 128 CPUs)

TensorDuck commented 9 years ago

32K configuration to 64K configuration requires a >4-fold increase in memory. I recall seeing segmentation faults when it attempts to use too much memory to construct the LSDMap.

You should try running with more CPUs to see if it's okay.

vivek-bala commented 9 years ago

I ran lsdmap for a file with 64K configurations with 32 cores and ran into the same error.

For 64K configurations, when I used 128 cores. It reports the following error: https://gist.github.com/vivek-bala/da7423cd35a2ea80ad00. But in this case, it has also produced the .eg, .ev and nearest neighbor files. Where the eigen vector file has 64000 lines.

vivek-bala commented 9 years ago

It was successful with no errors when I used 256 cores. But it took close to an hour to complete. Is that expected ?

TensorDuck commented 9 years ago

Hi Vivek,

How large of a protein are you using? How many atoms are you computing RMSD for?

How long did the 32K configurations take?

The scaling for LSDMap is O(N^2) time and memory, where N is number of frames. Most of the time and memory is spent computing the Distance and Kernel Matrix. Lorenzo and Cecilia are working on speeding it up, but that is not ready yet.

vivek-bala commented 9 years ago

I am using 1-alanine amino acid. It has 22 atoms (https://raw.githubusercontent.com/radical-cybertools/radical.ensemblemd/master/usecases/extasy_gromacs_lsdmap/inp_files/input.gro).

For 32K configurations on 64 cores, the time taken was 1215 seconds. For 64K configurations on 256 cores, the time taken was 3637 seconds. (Below 256 cores, I am consistently running into the same error as before)

I doubled the number of configurations are increased the resources by 4 times (to account for O(N^2) behaviour). The time taken seems to increase by 3 times.

TensorDuck commented 9 years ago

Hi Vivek,

That doesn't sound right actually. I'll double check though on our cluster here at Rice and make sure.

The processes not scaling with the number of processors is most baffling to me.

jp43 commented 9 years ago

Vivek, could you check the lsdmap.log file to see how much time each step is taking. The O(N^2) behaviour only applies for the computation of the distance matrix, the 3 times difference can be due to the fact that other steps are involved in the overall time.

vivek-bala commented 9 years ago

For 64K configs:

INFO:root:14:31:50: intializing LSDMap...
INFO:root:14:31:52: input coordinates loaded
INFO:root:14:32:01: LSDMap initialized
INFO:root:14:34:39: distance matrix computed
INFO:root:14:34:42: kernel diagonalized
INFO:root:14:34:50: Eigenvalues/eigenvectors saved (.eg/.ev files)
INFO:root:15:32:10: LSDMap computation done

For 32K configs:

INFO:root:17:40:16: intializing LSDMap...
INFO:root:17:40:16: input coordinates loaded
INFO:root:17:40:17: LSDMap initialized
INFO:root:17:42:56: distance matrix computed
INFO:root:17:42:58: kernel diagonalized
INFO:root:17:42:59: Eigenvalues/eigenvectors saved (.eg/.ev files)
INFO:root:17:59:18: LSDMap computation done
jp43 commented 9 years ago

Apparently, most of the time is spent between the two last lines of the log file. At that point, we are basically saving the distance matrix and/or nearest neighbors, only if the flags -n or -d are specified. Did you specify any of these options? If not I already ran into this strange problem when I was running LSDMap tests on Archer. I was basically doing nothing between the two last statements of the log file but it was still taking a lot of time. I concluded that somehow the logging module was not working well with many CPUs. If you are not using any of the flags -n and -d, try to comment the last lines of lsdm.py that print the last two statements of the log file to see if there is any difference.

vivek-bala commented 9 years ago

I use '-n' to name the neighbour file. I thought that was required. I'll try it with the changes you suggested.

vivek-bala commented 9 years ago

Could you tell me the exact lines to comment out please.

jp43 commented 9 years ago

I think the '-n' flag is only used when running DM-d-MD because you will need to save the nearest neighbors to be able to reweight correctly after selecting the new walkers. However, if you want only to test LSDMap, this option is not mandatory. The lines to comment are lines from 412 to 423 (included) in https://github.com/jp43/lsdmap/blob/master/lsdmap/lsdm.py. However if you use DM-d-MD, simply try commenting the lines 412 and 423 only. In that case, if you still get a slow time between the two last statements of the log file, it would mean that the function "save_nneighbors" takes most of the time.