patrickbryant1 / SpeedPPI

Rapid protein-protein interaction network creation from multiple sequence alignments with Deep Learning
Other
73 stars 16 forks source link

Notes regarding setup #10

Closed GaryAitken closed 1 year ago

GaryAitken commented 1 year ago

Given the rate at which python modules are updated, I understand it's difficult to keep things "current". I recently created a new google compute engine with gpu, A100 40GB w/1 gpu, machine type a2-highgpu-1g (12vCPU, 85GB) I was prompted that the OS config did not support gpu, and asked if I wanted to switch to a supported version. Answering yes, I ended up with what I presume is the "default" config. That was a debian 10 version. Subsequent testing revealed that the debian 10 environment has a ptxas version of 11.0.221, which is

 "older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors."

It appeared necessary to select an OS version other than the default, in this case "Debian 11 based Deep Learning VM with M109 and CUDA 11.3" to get an appropriate environment. If this is now a requirement, it would be good to note that and suggest running "ptxas --version" to check. It should look something like this:

$ ptxas --version
  built Mon_May__3_19:14:31_PDT_2021
  release 11.3, V11.3.109
  cuda_11.3.r11.3/compiler.29920130_0

Subsequently, the supplied "speed_ppi.yml" created a conda environment as follows:

  python          3.9.17
  jax             0.3.25
  ml-collections  0.1.1
  dm-haiku        0.0.9
  pandas          1.4.4
  biopython       1.79
  chex            0.0.7
  dm-tree         0.1.8     (>=0.1.6)
  immutabledict   3.0.0     (>=2.0.0)
  numpy           1.24.3    (>=1.19.3)
  scipy           1.11.1    (>=1.9.0)
  tensorflow-cpu  2.13.0    (>= 2.12.0)

This resulted in errors when running some_vs_some:

AttributeError: module 'numpy' has no attribute 'int' ...
  The aliases was originally deprecated in NumPy 1.20

Using trial-and-error and seat-of-the pants guessing, I tweaked speed_ppi.yml to create the following environment which seemed to work:

  python          3.9.17
  jax             0.3.25
  ml-collections  0.1.1
  dm-haiku        0.0.9
  pandas          1.4.4
  biopython       1.79
  chex            0.0.7
  dm-tree         0.1.8     (>=0.1.6)
  immutabledict   2.0.0     (==2.0.0)
  numpy           1.22.4    (==1.22.4)
  scipy           1.11.1    (>=1.9.0)
  tensorflow-cpu  2.12.0    (== 2.12.0)

It appears that at least for some modules, ">=" allows incompatible combinations. numpy in particular; possibly immutabledict and tensorflow-cpu. (I did not try all combinations of numpy, immutabledict and tensorflow-cpu; I stopped when I found the above combination that worked.)

patrickbryant1 commented 1 year ago

Hi,

This seems to be related to your OS in your Google computing engine. Note that the instructions here will not work will all OS:s. There are many more 'errors' than can occur when setting up GPU instructions for your machine and we can't possibly list all of them. It is up to the user to choose a suitable installation option of the ones we supply (which is why we offer different ways of installation).

This is a part of learning to code/do bioinformatics and we hope that you will find this lesson useful for the future.