usnistgov / fipy

FiPy is a Finite Volume PDE solver written in Python
http://pages.nist.gov/fipy/en/latest
Other
505 stars 148 forks source link

Unexpected parallel results #1017

Open Edward-Rawson opened 7 months ago

Edward-Rawson commented 7 months ago

I originally posted about this on the mailing board but, in hindsight, here seems more appropriate. I'm not getting the expected results when running in parallel across multiple cores on one cluster node. This problem persists when using either Trilinos with Gmsh 4.11.1 and PETSC with Gmsh 3.0.6. The script below (run with Trillinos --no-pysparse) demonstrates this:

import numpy as np
from fipy import *
import sys
from PyTrilinos import Epetra
np.set_printoptions(threshold=sys.maxsize)

mesh = GmshGrid3D(0.001, 0.001, 0.001, 10, 10, 10)
X = mesh.cellCenters[0]
T = CellVariable(mesh=mesh, value=20.)
alpha = CellVariable(mesh=mesh, value=2.88e-6)
T.setValue(100, where=(X >= 0.5*max(X)))
eq = TransientTerm() == DiffusionTerm(coeff=alpha)

sys.stdout.write(
    ("\n PID = "+str(Epetra.PyComm().MyPID())+"; "
     +str(mesh.numberOfCells)+
     " elements on processor "+str(parallelComm.procID)+
     " of "+str(parallelComm.Nproc)))
parallelComm.Barrier()

for step in range(3):
    eq.solve(T, dt=0.75)

parallelComm.Barrier()
Tg = T.globalValue
if parallelComm.procID == 0:
    Tr = np.reshape(Tg, (10, 10, 10))
    sys.stdout.write(("\n"+str(np.round(Tr[:, :, 9], 0))+"\n"))

Output from serial case:

PID = 0; 1000 elements on processor 0 of 1 [[32. 32. 32. 32. 32. 32. 32. 32. 32. 32.] [35. 35. 35. 35. 35. 35. 35. 35. 35. 35.] [39. 39. 39. 39. 39. 39. 39. 39. 39. 39.] [46. 46. 46. 46. 46. 46. 46. 46. 46. 46.] [55. 55. 55. 55. 55. 55. 55. 55. 55. 55.] [65. 65. 65. 65. 65. 65. 65. 65. 65. 65.] [74. 74. 74. 74. 74. 74. 74. 74. 74. 74.] [81. 81. 81. 81. 81. 81. 81. 81. 81. 81.] [85. 85. 85. 85. 85. 85. 85. 85. 85. 85.] [88. 88. 88. 88. 88. 88. 88. 88. 88. 88.]]

Output from the parallel case with Trilinos --no-pysparse, Gmsh 4.11.1 (made via "conda-trilinos-lock.yml"):

Warning : Saving a partitioned mesh in a format older than 4.0 may cause information loss

PID = 1; 197 elements on processor 1 of 10 PID = 2; 241 elements on processor 2 of 10 PID = 6; 185 elements on processor 6 of 10 PID = 7; 194 elements on processor 7 of 10 PID = 3; 207 elements on processor 3 of 10 PID = 4; 194 elements on processor 4 of 10 PID = 5; 199 elements on processor 5 of 10 PID = 8; 256 elements on processor 8 of 10 PID = 9; 196 elements on processor 9 of 10 PID = 0; 223 elements on processor 0 of 10 [[43. 48. 46. 55. 52. 62. 72. 70. 79. 44.] [42. 47. 45. 52. 62. 61. 70. 69. 77. 44.] [42. 47. 55. 52. 64. 60. 72. 79. 79. 40.] [37. 41. 39. 47. 58. 55. 67. 76. 81. 40.] [39. 42. 42. 48. 48. 54. 67. 76. 83. 41.] [44. 51. 49. 57. 69. 66. 76. 83. 83. 64.] [60. 72. 80. 80. 87. 84. 90. 88. 91. 79.] [85. 84. 89. 89. 92. 92. 91. 93. 93. 79.] [86. 84. 90. 89. 92. 92. 92. 94. 93. 83.] [88. 86. 88. 92. 90. 91. 93. 91. 92. 93.]]

With regards to that warning: Gmsh 4.11.1 was documented as Fipy-compatible (>=4.5.2) in 2021, but I tried downgrading to <4.0 due to on older discussion here. Furthermore, the FiPy documentation notes that "PyTrilinos on conda-forge presently only provides 12.10, which is limited to Python 2.x.", whilst the above environment was run with Python 3.7.12, PyTrilinos 12.18.1 and was installed via conda-forge. Although this may contribute, it wasn't relevant to the following case:

Output from the parallel case with PETSC, Gmsh 3.0.6 and without Epetra.PyComm().MyPID() or from PyTrilinos import Epetra:

196 elements on processor 5 of 10 215 elements on processor 7 of 10 199 elements on processor 8 of 10 192 elements on processor 1 of 10 239 elements on processor 3 of 10 186 elements on processor 2 of 10 192 elements on processor 4 of 10 227 elements on processor 6 of 10 210 elements on processor 9 of 10 180 elements on processor 0 of 10 [[44. 44. 44. 44. 44. 44. 44. 44. 43. 43.] [47. 47. 47. 47. 47. 47. 47. 47. 46. 46.] [53. 53. 53. 53. 53. 53. 53. 52. 52. 52.] [61. 60. 61. 61. 61. 61. 60. 60. 60. 59.] [68. 68. 68. 68. 69. 69. 68. 68. 67. 67.] [76. 76. 76. 76. 76. 76. 76. 76. 75. 75.] [83. 83. 83. 83. 83. 83. 83. 82. 82. 82.] [88. 88. 88. 88. 88. 87. 87. 87. 87. 87.] [91. 91. 91. 91. 91. 91. 91. 90. 90. 90.] [92. 92. 92. 92. 92. 92. 92. 92. 92. 92.]]

Closer, no more warning message, but also no correct results. I suspect that this could be a separate problem. With both cases:

I would appreciate any suggestions – if there are questions or if more information is needed, just let me know and I'll respond as soon as possible.

Kind regards, Ed

Edward-Rawson commented 7 months ago

Quick update: For the same Gmsh version (3.0.6), I have identical results with the Trilinos --no-pysparse solver as with the above PETSC (3.20.4) case. Considering this, I tried to replicate this method that mentioned Gmsh not always giving the correct sized grid by replacing the second chunk of code in above example by the following:

mesh = GmshGrid3D(0.001, 0.001, 0.001, 10, 10, 10)
#X = mesh.cellCenters[0]
T = CellVariable(mesh=mesh)
T0 = np.ones((10, 10, 10)) * 20.
T0[5:, :, :] = 100
T0 = np.resize(T0.flatten(), len(T.globalValue))
T.setValue(T0.copy())
alpha = CellVariable(mesh=mesh, value=2.88e-6)
#T.setValue(100, where=(X >= 0.5*max(X)))
eq = TransientTerm() == DiffusionTerm(coeff=alpha)

This returns; For Gmsh 3.0.6:

PID = 1; 192 elements on processor 1 of 10 PID = 2; 186 elements on processor 2 of 10 PID = 3; 239 elements on processor 3 of 10 PID = 4; 192 elements on processor 4 of 10 PID = 5; 196 elements on processor 5 of 10 PID = 6; 227 elements on processor 6 of 10 PID = 7; 215 elements on processor 7 of 10 PID = 8; 199 elements on processor 8 of 10 PID = 9; 210 elements on processor 9 of 10 PID = 0; 180 elements on processor 0 of 10 [[32. 32. 32. 32. 32. 32. 32. 32. 32. 32.] [35. 35. 35. 35. 35. 35. 35. 35. 35. 35.] [39. 39. 39. 39. 39. 39. 39. 39. 39. 39.] [46. 46. 46. 46. 46. 46. 46. 46. 46. 46.] [55. 55. 55. 55. 55. 55. 55. 55. 55. 55.] [65. 65. 65. 65. 65. 65. 65. 65. 65. 65.] [74. 74. 74. 74. 74. 74. 74. 74. 74. 74.] [81. 81. 81. 81. 81. 81. 81. 81. 81. 81.] [85. 85. 85. 85. 85. 85. 85. 85. 85. 85.] [88. 88. 88. 88. 88. 88. 88. 88. 88. 88.]]

For Gmsh 4.11.1:

Warning : Saving a partitioned mesh in a format older than 4.0 may cause information loss

PID = 1; 197 elements on processor 1 of 10 PID = 2; 241 elements on processor 2 of 10 PID = 3; 207 elements on processor 3 of 10 PID = 4; 194 elements on processor 4 of 10 PID = 5; 199 elements on processor 5 of 10 PID = 6; 185 elements on processor 6 of 10 PID = 7; 194 elements on processor 7 of 10 PID = 8; 256 elements on processor 8 of 10 PID = 9; 196 elements on processor 9 of 10 PID = 0; 223 elements on processor 0 of 10 [[30. 32. 32. 35. 36. 41. 48. 48. 56. 31.] [32. 32. 37. 37. 41. 44. 49. 53. 61. 33.] [44. 37. 36. 43. 42. 52. 50. 58. 60. 33.] [36. 36. 40. 41. 42. 47. 49. 58. 67. 38.] [50. 40. 52. 43. 55. 50. 54. 61. 69. 57.] [54. 55. 68. 65. 66. 71. 75. 75. 82. 55.] [60. 65. 70. 74. 76. 81. 81. 84. 85. 57.] [66. 65. 74. 73. 79. 79. 79. 82. 82. 58.] [67. 69. 74. 75. 80. 80. 81. 82. 83. 71.] [77. 78. 85. 83. 83. 88. 84. 85. 87. 90.]]

guyer commented 7 months ago

Thanks for the detailed report.

There are two issues going on here:

Using either Trilinos or PETSc, making these two changes gives

 PID = 1; 239 elements on processor 1 of 8
 PID = 2; 226 elements on processor 2 of 8
 PID = 3; 232 elements on processor 3 of 8
 PID = 4; 225 elements on processor 4 of 8
 PID = 5; 229 elements on processor 5 of 8
 PID = 7; 226 elements on processor 7 of 8
 PID = 6; 238 elements on processor 6 of 8
 PID = 0; 239 elements on processor 0 of 8
[[32. 32. 32. 32. 32. 32. 32. 32. 32. 32.]
 [35. 35. 35. 35. 35. 35. 35. 35. 35. 35.]
 [39. 39. 39. 39. 39. 39. 39. 39. 39. 39.]
 [46. 46. 46. 46. 46. 46. 46. 46. 46. 46.]
 [55. 55. 55. 55. 55. 55. 55. 55. 55. 55.]
 [65. 65. 65. 65. 65. 65. 65. 65. 65. 65.]
 [74. 74. 74. 74. 74. 74. 74. 74. 74. 74.]
 [81. 81. 81. 81. 81. 81. 81. 81. 81. 81.]
 [85. 85. 85. 85. 85. 85. 85. 85. 85. 85.]
 [88. 88. 88. 88. 88. 88. 88. 88. 88. 88.]]
guyer commented 7 months ago

Furthermore, the FiPy documentation notes that "PyTrilinos on conda-forge presently only provides 12.10, which is limited to Python 2.x.", whilst the above environment was run with Python 3.7.12, PyTrilinos 12.18.1 and was installed via conda-forge.

Thank you for pointing this out. The PyTrilinos feedstock on conda-forge still lags way behind Python versions, but you're correct that it installs in Python 3.7 (and even 3.8 on some platforms). I'll fix the documentation (#1018).

Edward-Rawson commented 7 months ago

Well this turned out to be a much simpler problem; me! I clearly have some more learning to do. Thanks for taking the time to explain, this would've taken me a while to find 👍. Good to know about the in-built slicing also

guyer commented 7 months ago

I would classify both of these issues as subtle. I appreciate you asking about them.