magicDGS / ReadTools

A Universal Toolkit for Handling Sequence Data from Different Sequencing Platforms
https://magicdgs.github.io/ReadTools/
MIT License
6 stars 3 forks source link

DownloadDistmapResult fails with many parts #530

Open robmaz opened 5 years ago

robmaz commented 5 years ago

Hi Daniel, we had now some problems when downloading results with many parts. The download would fail claiming that a file did not exist. The part exists and is ok, though, and can also be downloaded manually and looked at without problems. It also will not necessarily fail on the same part again, e.g., we had failure on parts 179, 165, 167, 178 on four subsequent download attempts. TMP_DIR was set to a partition that had plenty of space and -Xmx8g. We have never seen a problem with sets smaller than 150 or so parts. What do you make of that? (Debug output of one download attempt attached.)

download+merge.out.gz

magicDGS commented 5 years ago

@robmaz - thanks for the report. I think that the error is related more with the HDFS and the cluster configuration and network overload than with ReadTools itself. My conclusion comes from the fact that:

  1. If a part is considered to download, it was already tested to be existant. Thus, at startup ReadTools already detected the file there although downstream cannot read it.
  2. Re-running does not fail in the part again. That suggest a problem with the network connection - how many people is using the cluster?

Maybe an option is to decrease the --numberOfParts argument, to access less number of batches. That is a temporary solution, but I recommend you to look at the network and hdfs configuration to try to fix it. Let me know if that works.

From the ReadTools side, I will dig more where the problem is happening and I might add an option to retry the read operation if it fails; nevertheless, I am afraid that it won't work if the network failure happen in the middle of the process of a single file (which might be the case sometimes, as it looks like a non-deterministic thing).