spark-redshift-community / spark-redshift

Performant Redshift data source for Apache Spark
Apache License 2.0
135 stars 62 forks source link

Thread Safety Issue with makeTempPath in Utils.scala #121

Closed dwilsonGW closed 1 year ago

dwilsonGW commented 1 year ago

In the Utils.scala the makeTempPath method updates a field variable lastTempPathGenerated which only exists for tests. However, in the event that the library is utilized by multiple threads this can cause the makeTempPath to return incorrect results. Tested out a simple fix in a forked repo to store the value to a local variable and return that instead of the field variable.

/**
   * Creates a randomly named temp directory path for intermediate data
   */
  def makeTempPath(tempRoot: String): String = {
    val _lastTempPathGenerated = Utils.joinUrls(tempRoot, UUID.randomUUID().toString)
    lastTempPathGenerated = _lastTempPathGenerated
    _lastTempPathGenerated
  }
ylut commented 1 year ago

Hi @dwilsonGW We are doing multithreaded writing to Redshift (Java pool submitting multiple spark write() commands for different datasets) and the data ends up being written to the wrong tables. For example, one dataset would appear in the temp folder of another paralleled write and in that write's manifest as well (2 files where each file is a different dataset). The weird thing is that it also correctly goes in the right table as well...

So to put it - some datasets end up both in the correct table, and in the wrong table.

Could this be related to your issue?

dwilsonGW commented 1 year ago

@ylut 100% related to this issue. We're doing something very similar and that lead us to finding this issue.

jsleight commented 1 year ago

@dwilsonGW can you take your fix and submit it as a PR? I'm happy to review and get it merged.

dwilsonGW commented 1 year ago

@jsleight sure thing