idaholab / raven

RAVEN is a flexible and multi-purpose probabilistic risk analysis, validation and uncertainty quantification, parameter optimization, model reduction and data knowledge-discovering framework.
https://raven.inl.gov/
Apache License 2.0
219 stars 133 forks source link

[TASK] Issue finding tensorflow during Install RAVEN libraries for Mac M2 #2158

Closed yoshiurr-INL closed 7 months ago

yoshiurr-INL commented 1 year ago

Under Discussion Topic

Machine Specification Equipment: MacBook Pro OS: Ventura 13.5 Processor: Apple M2 Max

Screenshot 2023-07-27 at 12 28 32 PM

Summary of the topic to be discussed with the development team While installing RAVEN libraries using "--install", the pip install for tensorflow cannot find a version that satisfies the requirements of tensorflow==2.10.*

Screenshot 2023-07-27 at 12 19 29 PM Screenshot 2023-07-27 at 12 19 58 PM Screenshot 2023-07-27 at 12 20 41 PM

When trying to use "--mamba" instead, the installation process does not start.

Screenshot 2023-07-27 at 12 21 11 PM

Describe the solution you'd like to be implemented Identify whether this issue is common for Mac systems. Identify whether this issue is common for M1 and M2 chips.

Describe alternatives you've considered Maybe conda installing tensorflow?


For Change Control Board: Issue Review

This review should occur before any development is performed as a response to this issue.


For Change Control Board: Issue Closure

This review should occur when the issue is imminently going to be closed.

joshua-cogliati-inl commented 1 year ago

Hm, if you change the line in the dependencies.xml from: <tensorflow source="pip" os='mac,linux'>2.10</tensorflow> to <tensorflow os='mac,linux'>2.10</tensorflow> does it install?

(Note that we do not currently have automated testing on arm64)

wanghy-anl commented 1 year ago

@joshua-cogliati-inl Joshua, I found the identical issue on my M1 MacBook Pro 13 inch (OS: Ventura 13.5; Processor: Apple M1), just like Ramon experienced.

I tried to edit the dependencies.xml as you suggested, and the conda environment can be established by ./scripts/establish_conda_env.sh --install.

However, after ./build_raven and ./run_tests -j4, 23 tests are marked as "Diff" or "Failed". See the attached log file.

Haoyu log_run_test_j4_20230802.log

Hm, if you change the line in the dependencies.xml from: <tensorflow source="pip" os='mac,linux'>2.10</tensorflow> to <tensorflow os='mac,linux'>2.10</tensorflow> does it install?

(Note that we do not currently have automated testing on arm64)

joshua-cogliati-inl commented 1 year ago

Okay, so we can install it if we switch tensorflow back to conda-forge, but it fails some tests. I think the correct solution for this is probably to switch to a newer version of tensorflow.

wanghy-anl commented 1 year ago

Thanks Joshua. Let me know if you have any candidate versions in your mind. I can test on my M1 machine (it's idle recently)

Okay, so we can install it if we switch tensorflow back to conda-forge, but it fails some tests. I think the correct solution for this is probably to switch to a newer version of tensorflow.

joshua-cogliati-inl commented 1 year ago

Tensorflow 2.12 and 2.13 might be worth trying.

joshua-cogliati-inl commented 1 year ago

I started testing tensorflow 2.12 in https://github.com/idaholab/raven/pull/2138 but we need a few updates for it.

wanghy-anl commented 1 year ago

@joshua-cogliati-inl, here are the results: Using 2.12 (I modified Line 49 of dependencies.xml to <tensorflow os='mac,linux'>2.12</tensorflow>: Can establish conda environment, but has 14 Failed tests and 16 Diff tests, see log below; log_run_test_j4_tensorflow_2_12_2023AUG03.log

Using 2.13 (Only available through PIP channel, I modified Line 49 of dependencies.xml to <tensorflow source="pip" os='mac,linux'>2.13</tensorflow>: Can establish conda environment, but has 673 Failed tests, see log below; log_run_test_j4_tensorflow_2_13_2023AUG03.log

Tensorflow 2.12 and 2.13 might be worth trying.

joshua-cogliati-inl commented 1 year ago

Hm, for 2.13, something is being done incorrectly:

ImportError: Failed to import grpc on Apple Silicon. On Apple Silicon machines, try `pip uninstall grpcio; conda install grpcio`. Check out https://docs.ray.io/en/master/ray-overview/installation.html#m1-mac-apple-silicon-support for more details.
wanghy-anl commented 1 year ago

Is there anything we can do within raven's establish_conda_env.sh script?

Hm, for 2.13, something is being done incorrectly:

ImportError: Failed to import grpc on Apple Silicon. On Apple Silicon machines, try `pip uninstall grpcio; conda install grpcio`. Check out https://docs.ray.io/en/master/ray-overview/installation.html#m1-mac-apple-silicon-support for more details.
joshua-cogliati-inl commented 1 year ago

It might be worth adding 'grpcio' as a conda dependency and see if that solves it.

joshua-cogliati-inl commented 1 year ago

Otherwise, yes, we might need to modify establish_conda_env.sh

wanghy-anl commented 1 year ago

I added the <grpcio/> to dependencies.xml, and the conda environment can be established, but 14 failed and 16 diff tests. See the dependencies.xml and log attached. dependencies_and_log_2023AUG04.zip

It might be worth adding 'grpcio' as a conda dependency and see if that solves it.

joshua-cogliati-inl commented 1 year ago

I added the to dependencies.xml, and the conda environment can be established, but 14 failed and 16 diff tests. See the dependencies.xml and log attached.

It looks like a bunch of the diff and failed are because of the tensorflow update. So that is probably the first thing that we need to fix.

wanghy-anl commented 1 year ago

Joshua, let me know when you need to test the fix. I can do the test on M1 chip.

joshua-cogliati-inl commented 1 year ago

For future reference, these are the changes made to dependencies.xml compared to current devel (scipy is actually updated by a devel change, so we probably do not need to downgrade scipy, also smt was added in devel as well):

--- dependencies.xml    2023-08-28 10:20:41.567497521 -0600
+++ /tmp/.fr-NTKHA2/dependencies.xml    2023-08-04 08:39:21.000000000 -0600
@@ -37,7 +37,7 @@
   <main>
     <h5py/>
     <numpy>1.22</numpy>
-    <scipy>1.9</scipy>
+    <scipy>1.7</scipy>
     <scikit-learn>1.0</scikit-learn>
     <pandas/>
     <!-- Note most versions of xarray work, but some (such as 0.20) don't -->
@@ -46,8 +46,9 @@
     <matplotlib>3.5</matplotlib>
     <statsmodels>0.13</statsmodels>
     <cloudpickle>2.2</cloudpickle>
-    <tensorflow source="pip" os='mac,linux'>2.10</tensorflow>
-    <tensorflow source="pip" os='windows'>2.10</tensorflow>
+    <tensorflow source="pip" os='mac,linux'>2.13</tensorflow>
+    <tensorflow source="pip" os='windows'>2.13</tensorflow>
+    <grpcio/>
     <!-- conda is really slow on windows if the version is not specified.-->
     <python skip_check='True' os='windows'>3.8</python>
     <python skip_check='True' os='mac,linux'>3</python>
@@ -70,7 +71,6 @@
     <!-- redis is needed by ray, but on windows, this seems to need to be explicitly stated -->
     <redis source="pip" os='windows'/>
     <imageio source="pip">2.22</imageio>
-    <smt/>
     <line_profiler optional='True'/>
     <!-- <ete3 optional='True'/> -->
     <pywavelets optional='True'>1.1</pywavelets>
wanghy-anl commented 1 year ago

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

joshua-cogliati-inl commented 1 year ago

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

I just used the dependencies.xml file you included in your zip file, and I also just updated the https://github.com/idaholab/raven/pull/2138 with 2.13 instead of 2.12

wanghy-anl commented 1 year ago

Thanks, I will wait until #2138 gets merged and then test it on M1 chip.

joshua-cogliati-inl commented 1 year ago

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

It is on my joshua-cogliati-inl:tensorflow_212 branch that #2138 uses, it would be useful to know if it fixes things on the M1 chip.

wanghy-anl commented 1 year ago

It is on my joshua-cogliati-inl:tensorflow_212 branch that #2138 uses, it would be useful to know if it fixes things on the M1 chip.

Thanks Joshua, Let me give it a try on M1 chip tonight or tomorrow. I will attach the log file here.

joshua-cogliati-inl commented 1 year ago

FYI: If anyone uses the diff for the dependencies.xml, do not remove smt since that will cause newer versions of RAVEN to fail.

joshua-cogliati-inl commented 1 year ago

On further investigation, smt does not seem to be available for macos amd64: https://pypi.org/project/smt/#files so we probably do need to change <smt/> to <smt optional='True'/> and put imports that use smt into try catch blocks.

wanghy-anl commented 1 year ago

FYI: If anyone uses the diff for the dependencies.xml, do not remove smt since that will cause newer versions of RAVEN to fail.

Josh, you were correct. I deleted <smt/> in the attached dependencies_a.xml and 694 tests failed on M1 chip. See attached Log_Sep05_2023_a.log. So I re-added <smt source='pip'/> in the attached dependencies_b.xml and it runs better. 19 tests failed. See attached Log_Sep05_2023_b.log. Sep_5_2022_Trials.zip

joshua-cogliati-inl commented 1 year ago

Some errors I saw:

File ".../raven/ravenframework/Optimizers/acquisitionFunctions/AcquisitionFunction.py", line 138, in conductAcquisition res = sciopt.differential_evolution(optFunc, bounds=self._bounds, polish=self._polish, maxiter=self._maxiter, tol=self._tol, TypeError: differential_evolution() got an unexpected keyword argument 'vectorized'

File ".../python3.10/site-packages/netCDF4/__init__.py", line 3, in <module> from ._netCDF4 import ImportError: dlopen(.../python3.10/site-packages/netCDF4/_netCDF4.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_nc_close'

libc++abi: terminating due to uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::overflow_error>>: Error in function ibeta_derivative<e>(e,e,e): Overflow Error

Also, a bunch of diffs.

I think it is worth trying netcdf 1.6 to see if that fixes the netcdf errors. I think the floating point hardware must be a bit different and causing the overflow error and some of the diffs.

wangcj05 commented 1 year ago

[like] Congjian Wang reacted to your message:


From: Joshua J. Cogliati @.> Sent: Thursday, September 7, 2023 5:00:13 PM To: idaholab/raven @.> Cc: Congjian Wang @.>; Assign @.> Subject: [EXTERNAL] Re: [idaholab/raven] [UNDER-DISCUSSION] Issue finding tensorflow during Install RAVEN libraries for Mac M2 (Issue #2158)

Some errors I saw:

File ".../raven/ravenframework/Optimizers/acquisitionFunctions/AcquisitionFunction.py", line 138, in conductAcquisition res = sciopt.differential_evolution(optFunc, bounds=self._bounds, polish=self._polish, maxiter=self._maxiter, tol=self._tol, TypeError: differential_evolution() got an unexpected keyword argument 'vectorized'

File ".../python3.10/site-packages/netCDF4/init.py", line 3, in from ._netCDF4 import ImportError: dlopen(.../python3.10/site-packages/netCDF4/_netCDF4.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_nc_close'

libc++abi: terminating due to uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector>: Error in function ibeta_derivative(e,e,e): Overflow Error

Also, a bunch of diffs.

I think it is worth trying netcdf 1.6 to see if that fixes the netcdf errors. I think the floating point hardware must be a bit different and causing the overflow error and some of the diffs.

— Reply to this email directly, view it on GitHubhttps://github.com/idaholab/raven/issues/2158#issuecomment-1710496869, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABX3L36I2DCB67MEVPVZ5STXZH4R3ANCNFSM6AAAAAA22PJ3RE. You are receiving this because you were assigned.Message ID: @.***>

joshua-cogliati-inl commented 1 year ago

So apparently the remaining errors are:

FAILED:
Diff tests/framework/redundantInputs
Diff tests/framework/NDGridProbabilityWeightValue
Diff tests/framework/CodeInterfaceTests/CobraTF/test3
Diff tests/framework/pca_sparseGridCollocation/polyCorrelation
Diff tests/framework/PostProcessors/LimitSurface/testLimitSurfaceIntegralPPWithBoundingError
Diff tests/framework/Optimizers/GeneticAlgorithms/simionescuConstrainedInvLin
Diff tests/framework/Samplers/SparseGrid/normal
Failed tests/framework/Samplers/SparseGrid/betanorm
Failed tests/framework/Samplers/SparseGrid/beta
Diff tests/framework/Samplers/SparseGrid/triangular
Diff tests/framework/pca_adaptive_sgc/test_adaptive_sgc_poly_pca_analytic

PASSED: 778
SKIPPED: 93
FAILED: 11

I think a lot of those are from differences between how arm64 and amd64 handle floating point numbers. (From what I have seen online, I think basic arithmetic (+-*/) are the same, but things like floating to integer and back are different as well as functions like sin which will give differences eventually)

wangcj05 commented 1 year ago

[like] Congjian Wang reacted to your message:


From: Joshua J. Cogliati @.> Sent: Monday, September 11, 2023 4:39:31 PM To: idaholab/raven @.> Cc: Congjian Wang @.>; Assign @.> Subject: [EXTERNAL] Re: [idaholab/raven] [UNDER-DISCUSSION] Issue finding tensorflow during Install RAVEN libraries for Mac M2 (Issue #2158)

So apparently the remaining errors are:

FAILED: Diff tests/framework/redundantInputs Diff tests/framework/NDGridProbabilityWeightValue Diff tests/framework/CodeInterfaceTests/CobraTF/test3 Diff tests/framework/pca_sparseGridCollocation/polyCorrelation Diff tests/framework/PostProcessors/LimitSurface/testLimitSurfaceIntegralPPWithBoundingError Diff tests/framework/Optimizers/GeneticAlgorithms/simionescuConstrainedInvLin Diff tests/framework/Samplers/SparseGrid/normal Failed tests/framework/Samplers/SparseGrid/betanorm Failed tests/framework/Samplers/SparseGrid/beta Diff tests/framework/Samplers/SparseGrid/triangular Diff tests/framework/pca_adaptive_sgc/test_adaptive_sgc_poly_pca_analytic

PASSED: 778 SKIPPED: 93 FAILED: 11

I think a lot of those are from differences between how arm64 and amd64 handle floating point numbers. (From what I have seen online, I think basic arithmetic (+-*/) are the same, but things like floating to integer and back are different as well as functions like sin which will give differences eventually)

— Reply to this email directly, view it on GitHubhttps://github.com/idaholab/raven/issues/2158#issuecomment-1714232402, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABX3L33CMZXKCOWWLZH5F2LXZ45EHANCNFSM6AAAAAA22PJ3RE. You are receiving this because you were assigned.Message ID: @.***>

alfoa commented 1 year ago

Just FYI: (on M2, I had to download and "pip install" smt directly from https://github.com/SMTorg/SMT)

joshua-cogliati-inl commented 1 year ago

@alfoa Yes, we are discussing smt at: https://github.com/idaholab/raven/pull/2138#discussion_r1337680697

wangcj05 commented 1 year ago

This issue is partly addressed by PR #2138

joshua-cogliati-inl commented 11 months ago

It looks like #2201 fixed the beta Sampler problems:

(49/69) Success(  2.87sec)tests/framework/Samplers/SparseGrid/beta
(50/69) Success(  2.90sec)tests/framework/Samplers/SparseGrid/betanorm

Update: And for that matter all the RAVEN tests currently pass on Mac OS amd64:

PASSED: 794
SKIPPED: 95
FAILED: 0
 ... RAVEN tests passed successfully.
wangcj05 commented 11 months ago

[like] Congjian Wang reacted to your message:


From: Joshua J. Cogliati @.> Sent: Friday, November 10, 2023 5:57:25 PM To: idaholab/raven @.> Cc: Congjian Wang @.>; Assign @.> Subject: [EXTERNAL] Re: [idaholab/raven] [TASK] Issue finding tensorflow during Install RAVEN libraries for Mac M2 (Issue #2158)

It looks like #2201https://github.com/idaholab/raven/pull/2201 fixed the beta Sampler problems:

(49/69) Success( 2.87sec)tests/framework/Samplers/SparseGrid/beta (50/69) Success( 2.90sec)tests/framework/Samplers/SparseGrid/betanorm

— Reply to this email directly, view it on GitHubhttps://github.com/idaholab/raven/issues/2158#issuecomment-1806175780, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABX3L3Y6CMJ4Q6JBEYHAGNLYDZTILAVCNFSM6AAAAAA22PJ3RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWGE3TKNZYGA. You are receiving this because you were assigned.Message ID: @.***>

wangcj05 commented 7 months ago

It seems this issue has been resolved.