microsoft / AzureSMR

AzureSMR is no longer being actively developed. For ongoing support of Azure in R, see: https://github.com/Azure/AzureR
Other
60 stars 43 forks source link

Downloading large files (>2Gb) from Azure blob storage to R #83

Open Dacolo opened 7 years ago

Dacolo commented 7 years ago

the need is to download files from blob storage into a azure VM. For smaller files, this works and it’s also possible to set the type to “text” and leave out the rawToChar function. For larger files (>2 Gb) the download as type text throws an error, as data type raw character cannot be > 2Gb in R. Downloading with type set to raw works, but then the rawToChar function gives the same error (long vectors not supported yet: /builddir/patched_source/src/main/raw.c:68).

# Import data (code spaces as %20)
filename <- "20150101.csv"

data <- azureGetBlob(sc, 
             resourceGroup = "CLOGGING-PROD-RG",
             storageAccount = "cloggingmlsinput2", 
             container = "mlsinput",
             storageKey = sKey,
             blob=filename,
             type="raw")
data2 <- rawToChar(data, multiple = FALSE)
data3 <- read.csv(text=data2)

Error for large files:
Error in readBin(content, character()) : 
  R character strings are limited to 2^31-1 bytes

could we fix it? thanks!

hongooi73 commented 7 years ago

This is a fundamental limit in R. As a workaround, you can use download.file to save the file to disk, rather than trying to create an in-memory object.

andrie commented 7 years ago

I don't think this is a fundamental limit, actually. I don't have the link ready, but I think the API allows to download from blob in chunks. In other words, the download turns into a streaming operation where you stream bytes from blob to local disk.

hongooi73 commented 7 years ago

Well, I meant a fundamental limit on the length of character strings. Downloading to disk would get around that, sure. Wouldn't it still be simpler to use download.file?

andrie commented 7 years ago

download.file() will only work if the blob is public. If it's private, you must access via REST.

Also, I thought the vector limit is 2^58 - 1?

hongooi73 commented 7 years ago

Ah yes, I forgot about public v private.

The limit on the number of strings in a vector is essentially how much memory you have, but the limit on string size is 2^31 - 1; see ?"Memory-limits".

There are also limits on individual objects. The storage space cannot exceed the address limit, and if you try to exceed that limit, the error message begins cannot allocate vector of length. The number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9, which is also the limit on each dimension of an array.