Closed massugur closed 8 months ago
I'm not familiar with databricks, but if it's just a mount point I'm surprised. Was there an error message or was it silent? Is it reproducible with smaller objects?
There was no error message, just silent. Yes, it applied to all objects.
Thanks, I'd like to understand more. What's the easiest way to set up a databricks system?
I believe you can try Databricks Community Edition. (https://databricks.com/product/faq/community-edition)
I signed up for the community edition, but I'm not able to repro the issue (see below). Any ideas?
Since you're working with very large data, it could be an issue if you run out of memory or file space. But since you said it happens on small objects too, I am not sure.
Thanks for exploring this! I have tried the exactly same commands. And I still had the same problem. There was no error message for qsave(), it was silent. Is there anything else I can try to identify the problem? Thanks!
I'm not sure what the issue is, going to need more info.
Did you test this on the community edition, what's the difference between your set up and the community edition?
Could you post sessionInfo()
and can you test out installing the latest update? devtools::install_github("traversc/qs")
Thanks for looking into this!
I tested on the community edition. It perfectly worked, just like what you showed. This time I also tried fst::write_fst()
. fst::write.fst()
worked on the community edition, but not in my set up (paid version).
In summary,
saveRDS(mtcars, "/dbfs/temp.rds")
worked on both community edition and my set up. Many others also worked on my set up writting to DBFS, such as data.table::fwrite()
and arrow::write_arrow()
. qsave(mtcars, "/dbfs/temp.qs")
worked only on community edition. On my set up, qsave(mtcars, "/dbfs/temp.qs")
didn't work (no file saved), but no error message was shown. qsave(mtcars, "temp.qs")
worked on my set up, with the file saved on local driver node. qread("/dbfs/temp.qs")
worked after I manually copied the file from local driver node to DBFS. write_fst(mtcars, "/dbfs/temp.fst")
worked only on community edition. write_fst(mtcars, "temp.fst")
worked on my set up, with the file saved on local driver node. read_fst("temp.fst")
worked. However, read_fst("/dbfs/temp.fst")
didn't work even after I manually moved the file from local driver node to DBFS. Please find the error message below.
I noticed this line in DBFS documentation (https://docs.databricks.com/data/databricks-file-system.html):
Does not support random writes. For workloads that require random writes, perform the I/O on local disk first and then copy the result to /dbfs.
Do you think it could be something relevant? Or I can definitely try to provide more details.
R version 3.6.2 (2019-12-12) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.7 LTS
Matrix products: default BLAS: /usr/lib/openblas-base/libblas.so.3 LAPACK: /usr/lib/libopenblasp-r0.2.18.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] fst_0.9.4 arrow_3.0.0 qs_0.23.6 data.table_1.12.6 [5] dplyr_0.8.3
loaded via a namespace (and not attached):
[1] stringfish_0.15.0 Rcpp_1.0.2 magrittr_1.5
[4] bit_1.1-14 tidyselect_0.2.5 RApiSerialize_0.1.0
[7] R6_2.4.0 rlang_0.4.1 hwriter_1.3.2
[10] SparkR_2.4.5 tools_3.6.2 parallel_3.6.2
[13] htmltools_0.4.0 bit64_0.9-7 RcppParallel_5.0.3
[16] assertthat_0.2.1 digest_0.6.22 tibble_2.1.3
[19] Rserve_1.8-6 crayon_1.3.4 purrr_0.3.3
[22] vctrs_0.2.0 hwriterPlus_1.0-3 zeallot_0.1.0
[25] glue_1.3.1 compiler_3.6.2 pillar_1.4.2
[28] backports_1.1.5 TeachingDemos_2.10 pkgconfig_2.0.3
Thank you for digging that up. I didn't even realize it's possible for a file system to not support random writes lol.
I'll try to investigate a bit more later this weekend.
If that is the culprit there is not much that can be done, and is on Microsoft.
I think DBFS FUSE mount version causes the problem. Thanks again for taking time to look into this!
According to Databricks Documentation:
Azure Databricks uses a FUSE mount to provide local access to files stored in the cloud. A FUSE mount is a secure, virtual filesystem. FUSE V2 (default for Databricks Runtime 6.x and 7.x). Does not support random writes. FUSE V1 (default for Databricks Runtime 5.5 LTS) If you experience issues with FUSE V1 on
5.5 LTS, Databricks recommends that you use FUSE V2 instead. You can override the default FUSE version in 5.5 LTS by setting the environment variable DBFS_FUSE_VERSION=2.
I've tried the following on my set-up.
qsave("/dbfs/temp.qs)
workrd. qsave("/dbfs/temp.qs")
did't work, no error message.qsave("/dbfs/temp.qs")
didn't work, no error message. The above results applied to fst("/dbfs/temp.fst")
as well.
I think you're right. The community edition worked on all versions (maybe not a real FUSE drive), and I didn't get around to spinning up an Azure system.
The new version of qs
(0.24.1 on CRAN) should hopefully at least return an error.
Could you check that out if you're using 0.23.6?
I used 0.24.1 to produce the above results today :) No error messages.
Yes. The community edition worked on all versions. I realized the community edition has different disk mappings. In community edition, /dbfs/
maps to local file system, which can be found by %fs ls file::/dbfs/
, but not %fs ls dbfs:/
. While in the paid edition, /dbfs/
refers to files on DBFS, %fs ls dbfs:/
.
I loaded up Azure and did some testing finally.
devtools::install_github("traversc/stringfish") # dependancy, necessary to use github version for other reasons
devtools::install_github("traversc/qs")
library(qs)
qsave(1, file="/dbfs/temp.qs")
Should give you an error now:
basic_ios::clear: iostream error
This could be a little more descriptive, but at least it's not silent.
Btw, I believe fst is already fixed (ver 0.9.4)?
library(fst)
write_fst(mtcars, path="/dbfs/temp.fst")
Error in write_fst(mtcars, path = "/dbfs/temp.fst") : Error in write_fst(mtcars, path = "/dbfs/temp.fst") :
There was an error during the write operation,
fst file might be corrupted. Please check available disk space and access rights.
Sorry for the confusion. Yes, fst v 0.9.4 can't write to DBFS (FUSE v2), but provided error messages.
I've tried installing the latest update and now got an error message.
basic_ios::clear: iostream error
Thanks for helping on this!
I'm happy to start a new issue because it's only tangentially related. But I was also getting
basic_ios::clear: iostream error
. I was red-lining my memory use so thought it could be from that. But when I changed it to saveRDS
I got
cannot open compressed file '/home/...VERY LONG FILENAME...', probable reason 'File name too long'
and I didn't realize there was such a limit. I made a shorter filename, reran, and both qs and saveRDS worked
I think it could be helpful for qs to also give probable reasons for such weirdness if possible because it's a pretty easy fix once I knew what to do
(Side note: qs is amazing and so fast. I use it constantly every day and it's great!)
Also just for posterity, it's a 255 character limit on my system
seq(1, 500) %>% map(~{
print(.)
FN = paste0(rep("A", .), collapse="")
fqsave(1, glue("/tmp/{FN}"))
})
This is a wonderful package! qsave() is 24x faster than saveRDS() in my case when saving a large R object (30-60GB).
When I was working on Databricks clusters, I was unable to write .qs file as
qsave(df, “/dbfs/myfile.qs“
(no error message). However, I can successfully dosaveRDS(df, “/dbfs/myfile.rds”)
.I figured out a walkaround is that I can write to the local driver node first
qsave(df, “myfile.qs”)
and then transfer the .qs to DBFS location. I am pretty sure it is not the most efficient way, did I miss anything here?Meanwhile I had no problem in reading from DBFS after I transferred the file
qread("/dbfs/myfile.qs")
.