guanh01 commented 4 years ago

Just for curiosity, I tried to profile a tensorflow and a pytorch script using scalene but got segmentation faults for both scripts. The python scripts come from tensorflow and pytorch tutorial.

Environment:

python: 3.7 tensorflow: 2.2 pytorch: 1.5 scalene: installed using homebrew System: MacOS Catalina version 10.15.5

Below are the details to reproduce the error:

tensorflow

python script:


import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10) ])

predictions = model(x_train[:1]).numpy() print("predictions", predictions)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test, verbose=2)


- I can successfuly execute the script: 

```(base) ➜  scalene git:(master) ✗ python ./test/tf-keras.py
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
2020-07-11 22:48:31.990954: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-11 22:48:32.014027: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fd41b7a03b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-11 22:48:32.014048: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
predictions [[-0.27964824  0.78479844 -0.39851144  0.14115062  0.09268872 -0.1322664
   0.04173797 -0.04924813 -0.10641377  0.1781306 ]]
Epoch 1/5
1875/1875 [==============================] - 1s 717us/step - loss: 0.3024 - accuracy: 0.9126
Epoch 2/5
1875/1875 [==============================] - 1s 714us/step - loss: 0.1416 - accuracy: 0.9576
Epoch 3/5
1875/1875 [==============================] - 1s 708us/step - loss: 0.1070 - accuracy: 0.9674
Epoch 4/5
1875/1875 [==============================] - 1s 699us/step - loss: 0.0881 - accuracy: 0.9731
Epoch 5/5
1875/1875 [==============================] - 1s 708us/step - loss: 0.0749 - accuracy: 0.9766
313/313 - 0s - loss: 0.0748 - accuracy: 0.9766

Profile using scalene gives segmentation fault.

(base) ➜  scalene git:(master) ✗ scalene ./test/tf-keras.py 
2020-07-11 22:49:21.339374: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-11 22:49:21.370759: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x11b1b4630 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-11 22:49:21.370778: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
predictions [[-0.1633205  -0.22706667  0.58521605  0.357562   -0.51517636  0.45471746
-0.10387493  0.41047204 -0.26368517  0.10465179]]
Epoch 1/5
/usr/local/bin/scalene: line 3: 24592 Segmentation fault: 11  DYLD_INSERT_LIBRARIES=/usr/local/Cellar/libscalene/HEAD-a49f5ca/lib/libscalene.dylib PYTHONMALLOC=malloc python3 -m scalene "$@"

Similar for Pytorch script

python script:


# -*- coding: utf-8 -*-
import random
import torch

class DynamicNet(torch.nn.Module): def init(self, D_in, H, D_out): """ In the constructor we construct three nn.Linear instances that we will use in the forward pass. """ super(DynamicNet, self).init() self.input_linear = torch.nn.Linear(D_in, H) self.middle_linear = torch.nn.Linear(H, H) self.output_linear = torch.nn.Linear(H, D_out)

def forward(self, x):
    """
    For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
    and reuse the middle_linear Module that many times to compute hidden layer
    representations.

    Since each forward pass builds a dynamic computation graph, we can use normal
    Python control-flow operators like loops or conditional statements when
    defining the forward pass of the model.

    Here we also see that it is perfectly safe to reuse the same Module many
    times when defining a computational graph. This is a big improvement from Lua
    Torch, where each Module could be used only once.
    """
    h_relu = self.input_linear(x).clamp(min=0)
    for _ in range(random.randint(0, 3)):
        h_relu = self.middle_linear(h_relu).clamp(min=0)
    y_pred = self.output_linear(h_relu)
    return y_pred

N is batch size; D_in is input dimension;

H is hidden dimension; D_out is output dimension.

N, D_in, H, D_out = 64, 1000, 100, 10

Create random Tensors to hold inputs and outputs

x = torch.randn(N, D_in) y = torch.randn(N, D_out)

Construct our model by instantiating the class defined above

model = DynamicNet(D_in, H, D_out)

Construct our loss function and an Optimizer. Training this strange model with

vanilla stochastic gradient descent is tough, so we use momentum

criterion = torch.nn.MSELoss(reduction='sum') optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9) for t in range(500):

Forward pass: Compute predicted y by passing x to the model

y_pred = model(x)

# Compute and print loss
loss = criterion(y_pred, y)
if t % 100 == 99:
    print(t, loss.item())

# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()

- The script can run successfully

(base) ➜ scalene git:(master) ✗ python ./test/torch-dynamic-model.py 99 38.210121154785156 199 0.7706254720687866 299 2.6024699211120605 399 0.5532416701316833 499 0.3656597137451172


- But I got segmentation fault when profiling the script

(base) ➜ scalene git:(master) ✗ scalene test/torch-dynamic-model.py /usr/local/bin/scalene: line 3: 23709 Segmentation fault: 11 DYLD_INSERT_LIBRARIES=/usr/local/Cellar/libscalene/HEAD-a49f5ca/lib/libscalene.dylib PYTHONMALLOC=malloc python3 -m scalene "$@"

GammaPi commented 4 years ago

The Tensorflow script seems working on Linux Environment: python: Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) tensorflow: 1.7.8 (CPU) scalene: installed by pip System: Ubuntu 18.04 (Memory profiling not enabled) Screenshot from 2020-07-12 16-58-42

GammaPi commented 4 years ago

PyTorch script fails though

Screenshot from 2020-07-12 17-07-14

Fatal Python error: GC object already tracked

Thread 0x00007f6d7e613700 (most recent call first):

Thread 0x00007f6d7ee14700 (most recent call first):

Current thread 0x00007f6ea0975240 (most recent call first):
  File "test.py", line 31 in forward
  File "/usr/local/share/anaconda3/envs/Tensorflow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550 in __call__
  File "test.py", line 55 in <module>
  File "/usr/local/share/anaconda3/envs/Tensorflow/lib/python3.6/site-packages/scalene/scalene.py", line 1394 in main
  File "/usr/local/share/anaconda3/envs/Tensorflow/lib/python3.6/site-packages/scalene/__main__.py", line 4 in main
  File "/usr/local/share/anaconda3/envs/Tensorflow/lib/python3.6/site-packages/scalene/__main__.py", line 7 in <module>
  File "/usr/local/share/anaconda3/envs/Tensorflow/lib/python3.6/runpy.py", line 85 in _run_code
  File "/usr/local/share/anaconda3/envs/Tensorflow/lib/python3.6/runpy.py", line 193 in _run_module_as_main

emeryberger commented 4 years ago

Thanks for the report, @guanh01 ! :)

I tracked down the issue(s), and have successfully run both programs on Mac OS and Linux. You can get the latest version via pip install -U scalene. (A note to you and @GammaPi - by default, scalene now runs with memory profiling -- you have to explicitly disable it with --cpu-only. This is faster but of course does not give you memory usage info.)

The issues were as follows:

There was a race in the heap, so I have locked the entire heap for now. I have plans to make the sampling mechanism scale, which is actually somewhat tricky, so I have not yet implemented that. For now, unfortunately, this will degrade the performance of third-party libraries that run with multiple threads.
TensorFlow, at least, appears to be sending memory to be freed that wasn't actually allocated via the memory profiling library; I added some additional checks to discard such free calls. Unfortunately, this may leak memory (possibly avoidably).

plasma-umass / scalene

Segmentation fault using tensorflow or pytorch #67