sbrunk / storch

GPU accelerated deep learning and numeric computing for Scala 3.
https://storch.dev
Apache License 2.0
113 stars 7 forks source link

Weird tensor behavior #25

Closed davoclavo closed 1 year ago

davoclavo commented 1 year ago

Hello! I have been enjoying implementing a couple of different models using storch, however after getting into more complicated stuff I stumbled upon a weird behavior where it seems Tensors ended up having corrupted data. Apparently it happens more often when I am creating a large amount of Tensors from Scala values, but I am still not sure why/how it is happening. While I investigate, I wanted to share a simple scenario to replicate the bug:

package torch

case object WeirdTest {
  val tensors = 1.to(100).map { _ =>
    Tensor(1L.to(10000L).toArray)
  }
  def main(args: Array[String]): Unit = {
    println(tensors)
  }
}
sbt 'runMain torch.WeirdTest'

[info] Vector(tensor dtype=int64, shape=[10000], device=CPU
[info] [1, 2, 3, ..., 9998, 9999, 10000], tensor dtype=int64, shape=[10000], device=CPU
[info] [154000347366236, 140, 68719477348, ..., 1753600, 2297117284935663616, 0], tensor dtype=int64, shape=[10000], device=CPU
[info] [140432769199616, 32768, 0, ..., 0, 0, 0], tensor dtype=int64, shape=[10000], device=CPU
[info] [0, 0, 0, ..., 140432789977264, 140432789977368, 140432789977472], tensor dtype=int64, shape=[10000], device=CPU
[info] [0, 0, 0, ..., 140432769394640, 8589934592, 140432769394672], tensor dtype=int64, shape=[10000], device=CPU
[info] [0, 0, 0, ..., 140432789289736, 140432789288520, 140432789291272], tensor dtype=int64, shape=[10000], device=CPU
[info] [140432295821312, 67568, 140432767411688, ..., 9998, 9999, 10000], tensor dtype=int64, shape=[10000], device=CPU
[info] [140432790755296, 140432790755296, 140432790755296, ..., 9998, 9999, 10000], tensor dtype=int64, shape=[10000], device=CPU
[info] [140432767168000, 35536, 140432767168416, ..., 140432767174032, 140432767174232, 140432767174432], tensor dtype=int64, shape=[10000], device=CPU

NOTE: All tensors should have numbers from 1 to 10000 as int64, but there are some with some really large values.

If I reduce the amount to 1000, it generates the Tensors properly:

[info] Vector(tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU
[info] [1, 2, 3, ..., 998, 999, 1000], tensor dtype=int64, shape=[1000], device=CPU

Extra information:

sbt -v
◼
[sbt_options] declare -a sbt_options=()
[process_args] java_version = '19'
[copyRt] java9_rt = '/Users/david/.sbt/1.0/java9-rt-ext-azul_systems__inc__19_0_1/rt.jar'
# Executing command line:
/nix/store/29jh3mjhc8r3rx5hq1cn8nhl77p9z4qc-zulu19.30.11-ca-jdk-19.0.1/bin/java
-Dfile.encoding=UTF-8
-Xms1024m
-Xmx1024m
-Xss4M
-XX:ReservedCodeCacheSize=128m
-Dsbt.script=/nix/store/1j4jqczlm8w31rnq52kpjynf9kb7f4y0-devenv-profile/bin/sbt
-Dscala.ext.dirs=/Users/david/.sbt/1.0/java9-rt-ext-azul_systems__inc__19_0_1
-jar
/nix/store/in00n8p773wp36r52j7a9kk7gx1pnjvm-sbt-1.8.2/share/sbt/bin/sbt-launch.jar

sbt:root> eval System.out.println("OS architecture: " + System.getProperty("os.arch"));
OS architecture: x86_64
davoclavo commented 1 year ago

As a corollary, it would be great to be able to pass multi-dimensional arrays to Tensor construction :D I might give it a shot soon.

Perhaps due to the creation of multiple Tensors (in order to work around this multi-dimensional Array issue) is uncovering some kind of race condition under the hood with javacpp-presets + libtorch

darrenjw commented 1 year ago

Just some additional information. I have also had a lot of trouble with storch tensors containing incorrect values, non-deterministically. ie. sometimes when I run the code it will work correctly, giving output matching pytorch, and other times it will return garbage. But I hadn't managed to create a nice minimal (semi-)reproducible example.

sbrunk commented 1 year ago

That's obviously quite bad. Thanks for looking into it @davoclavo, I'll try to investigate as well.

I've seen garbarge a few times as well when playing with the API interactively in the REPL, but like @darrenjw I haven't been able to reliably reproduce it either.

I think this is also a good reason to improve the test coverage.

sbrunk commented 1 year ago

Ok I think I have a hunch. It could be that when calling from_blob, it does not copy the buffer, and then when the array/buffer is freed, we end up with garbage. If we clone the native tensor, it seems to be working, at least in my little test:

https://github.com/sbrunk/storch/blob/3fb3c5d5b58ca51bdee2e64f15cb4c0299cbbd0a/core/src/main/scala/torch/Tensor.scala#L793-L799

        Tensor(
          torchNative
            .from_blob(
              pointer,
              Array(data.length.toLong),
              NativeConverters.tensorOptions(inputDType, layout, CPU, requiresGrad)
-            )
+            ).clone()
case object WeirdTest {
  val range = 1L.to(10_000L).toSeq
  val tensors = 1.to(100).map { _ =>
    Tensor(range)
  }
  def main(args: Array[String]): Unit = {
    println(tensors)
    tensors.zipWithIndex.foreach { (tensor, index) =>
      println(s"$index: $tensor")
      assert(tensor.toSeq == range)
    }
  }
}
davoclavo commented 1 year ago

@sbrunk @darrenjw Thanks for your input!

I tried adding .clone() and it seems it has fixed the problem! I also wrote a unit test where I was able to consistently replicate this issue, and validate that it was fixed. I can submit a PR soon with this change.