oxinabox / DataDeps.jl

reproducible data setup for reproducible science
Other
151 stars 43 forks source link

Download problem: file name too long #137

Closed greimel closed 3 years ago

greimel commented 3 years ago

Hi,

I would like to save this dataset

url = "https://data.humdata.org/dataset/e9988552-74e4-4ff4-943f-c782ac8bca87/resource/7570bcc3-a208-49c4-8821-17f8df93c0e2/download/gadm1_nuts2_gadm1_nuts2_aug2020.tsv"

f = download(url) works fine, it downloads the 83 MB file to a temporary file. When I define a DataDep however, there is an error.

sci_dep = DataDep(
    "sci-nuts2",
    "",
    "https://data.humdata.org/dataset/e9988552-74e4-4ff4-943f-c782ac8bca87/resource/7570bcc3-a208-49c4-8821-17f8df93c0e2/download/gadm1_nuts2_gadm1_nuts2_aug2020.tsv"
)

download(sci_dep, mktempdir())

I get

┌ Info: Downloading
│   source = "https://data.humdata.org/dataset/e9988552-74e4-4ff4-943f-c782ac8bca87/resource/7570bcc3-a208-49c4-8821-17f8df93c0e2/download/gadm1_nuts2_gadm1_nuts2_aug2020.tsv"
│   dest = "/var/folders/8w/dmktz6rj1mq_4gzj_j56p9wm0000gn/T/jl_prGmVu/gadm1_nuts2_gadm1_nuts2_aug2020.tsv"
│   progress = 1.0
│   time_taken = "0.02 s"
│   time_remaining = "0.0 s"
│   average_speed = "1.541 MiB/s"
│   downloaded = "28.411 KiB"
│   remaining = "0 bytes"
└   total = "28.411 KiB"

then it hangs for a wile. (note the size: 28 KB i/o 83 MB)

ERROR: SystemError: opening file "/var/folders/8w/dmktz6rj1mq_4gzj_j56p9wm0000gn/T/jl_prGmVu/gadm1_nuts2_gadm1_nuts2_aug2020.tsv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=180&X-Amz-Credential=AKIARZNKTAO7U6UN77MP%2F20210313%2Feu-central-1%2Fs3%2Faws4_request&X-Amz-SignedHeaders=host&X-Amz-Date=20210313T090308Z&X-Amz-Signature=9cb9acea53c1c94268f1d4ea37c4c0cc6150e997407480e36ee170d9476556f3": File name too long
Stacktrace:
 [1] systemerror(::String, ::Int32; extrainfo::Nothing) at ./error.jl:168
 [2] #systemerror#48 at ./error.jl:167 [inlined]
 [3] systemerror at ./error.jl:167 [inlined]
 [4] open(::String; lock::Bool, read::Nothing, write::Nothing, create::Nothing, truncate::Bool, append::Nothing) at ./iostream.jl:284
 [5] open(::String, ::String; lock::Bool) at ./iostream.jl:346
 [6] open(::String, ::String) at ./iostream.jl:346
 [7] open(::HTTP.var"#24#31"{HTTP.Streams.Stream{HTTP.Messages.Response,HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}},HTTP.var"#report_callback#30"{Float64,Dates.DateTime,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}},Float32}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:323
 [8] open(::Function, ::String, ::String) at ./io.jl:323
 [9] (::HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}})(::HTTP.Streams.Stream{HTTP.Messages.Response,HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}}) at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/download.jl:132
 [10] macro expansion at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/StreamRequest.jl:70 [inlined]
 [11] macro expansion at ./task.jl:332 [inlined]
 [12] request(::Type{HTTP.StreamRequest.StreamLayer{Union{}}}, ::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, ::HTTP.Messages.Request, ::Nothing; reached_redirect_limit::Bool, response_stream::Nothing, iofunction::HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}}, verbose::Int64, kw::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/StreamRequest.jl:57
 [13] request(::Type{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}, ::URIs.URI, ::HTTP.Messages.Request, ::Nothing; proxy::Nothing, socket_type::Type{T} where T, reuse_limit::Int64, kw::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:iofunction, :reached_redirect_limit),Tuple{HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}},Bool}}}) at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/ConnectionRequest.jl:108
 [14] request(::Type{HTTP.ExceptionRequest.ExceptionLayer{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}}, ::URIs.URI, ::Vararg{Any,N} where N; kw::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:iofunction, :reached_redirect_limit),Tuple{HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}},Bool}}}) at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/ExceptionRequest.jl:19
 [15] (::Base.var"#56#58"{Base.var"#56#57#59"{ExponentialBackOff,HTTP.RetryRequest.var"#2#3"{Bool,HTTP.Messages.Request},typeof(HTTP.request)}})(::Type{T} where T, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:iofunction, :reached_redirect_limit),Tuple{HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}},Bool}}}) at ./error.jl:288
 [16] #request#1 at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/RetryRequest.jl:44 [inlined]
 [17] request(::Type{HTTP.MessageRequest.MessageLayer{HTTP.RetryRequest.RetryLayer{HTTP.ExceptionRequest.ExceptionLayer{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}}}}, ::String, ::URIs.URI, ::Array{Pair{SubString{String},SubString{String}},1}, ::Nothing; http_version::VersionNumber, target::String, parent::HTTP.Messages.Response, iofunction::Function, kw::Base.Iterators.Pairs{Symbol,Bool,Tuple{Symbol},NamedTuple{(:reached_redirect_limit,),Tuple{Bool}}}) at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/MessageRequest.jl:66
 [18] request(::Type{HTTP.BasicAuthRequest.BasicAuthLayer{HTTP.MessageRequest.MessageLayer{HTTP.RetryRequest.RetryLayer{HTTP.ExceptionRequest.ExceptionLayer{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}}}}}, ::String, ::URIs.URI, ::Array{Pair{SubString{String},SubString{String}},1}, ::Nothing; kw::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:reached_redirect_limit, :iofunction, :parent),Tuple{Bool,HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}},HTTP.Messages.Response}}}) at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/BasicAuthRequest.jl:28
 [19] request(::Type{HTTP.RedirectRequest.RedirectLayer{HTTP.BasicAuthRequest.BasicAuthLayer{HTTP.MessageRequest.MessageLayer{HTTP.RetryRequest.RetryLayer{HTTP.ExceptionRequest.ExceptionLayer{HTTP.ConnectionRequest.ConnectionPoolLayer{HTTP.StreamRequest.StreamLayer{Union{}}}}}}}}}, ::String, ::URIs.URI, ::Array{Pair{SubString{String},SubString{String}},1}, ::Nothing; redirect_limit::Int64, forwardheaders::Bool, kw::Base.Iterators.Pairs{Symbol,HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}},Tuple{Symbol},NamedTuple{(:iofunction,),Tuple{HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}}}}}) at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/RedirectRequest.jl:24
 [20] request(::String, ::String, ::Array{Pair{SubString{String},SubString{String}},1}, ::Nothing; headers::Array{Pair{SubString{String},SubString{String}},1}, body::Nothing, query::Nothing, kw::Base.Iterators.Pairs{Symbol,HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}},Tuple{Symbol},NamedTuple{(:iofunction,),Tuple{HTTP.var"#23#29"{Float32,String,String,HTTP.var"#format_progress#25",HTTP.var"#format_bytes#26",HTTP.var"#format_seconds#27",HTTP.var"#format_bytes_per_second#28"{HTTP.var"#format_bytes#26"}}}}}) at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/HTTP.jl:315
 [21] #open#7 at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/HTTP.jl:349 [inlined]
 [22] open at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/HTTP.jl:349 [inlined]
 [23] #download#22 at /Users/fabiangreimel/.julia/packages/HTTP/cxgat/src/download.jl:101 [inlined]
 [24] #fetch_http#26 at /Users/fabiangreimel/.julia/packages/DataDeps/ooWXe/src/fetch_helpers.jl:80 [inlined]
 [25] fetch_http at /Users/fabiangreimel/.julia/packages/DataDeps/ooWXe/src/fetch_helpers.jl:79 [inlined]
 [26] fetch_default(::String, ::String) at /Users/fabiangreimel/.julia/packages/DataDeps/ooWXe/src/fetch_helpers.jl:33
 [27] run_fetch at /Users/fabiangreimel/.julia/packages/DataDeps/ooWXe/src/resolution_automatic.jl:99 [inlined]
 [28] download(::DataDep{Nothing,String,typeof(DataDeps.fetch_default),typeof(identity)}, ::String; remotepath::String, i_accept_the_terms_of_use::Nothing, skip_checksum::Bool) at /Users/fabiangreimel/.julia/packages/DataDeps/ooWXe/src/resolution_automatic.jl:78
 [29] download(::DataDep{Nothing,String,typeof(DataDeps.fetch_default),typeof(identity)}, ::String) at /Users/fabiangreimel/.julia/packages/DataDeps/ooWXe/src/resolution_automatic.jl:70
 [30] top-level scope at REPL[22]:1

Do you have any idea what's going wrong?

The same error occurs when registering the DataDep and then doing datadep"sci_nuts2".

oxinabox commented 3 years ago

oh sorry I thought i replied to this ages ago. Something is going wrong with HTTP.download's ability to detect the filename. That will have to be solved upstream in https://github.com/JuliaWeb/HTTP.jl/issues/696

Luckly this is one of the cases where the simpler way of determining the filename works:

julia> DataDeps.fetch_base(url, pwd())
"/Users/oxinabox/temp/11/gadm1_nuts2_gadm1_nuts2_aug2020.tsv"

Which means we have a work around by setting the fetch_method

sci_dep = DataDep(
    "sci-nuts2",
    "",
    "https://data.humdata.org/dataset/e9988552-74e4-4ff4-943f-c782ac8bca87/resource/7570bcc3-a208-49c4-8821-17f8df93c0e2/download/gadm1_nuts2_gadm1_nuts2_aug2020.tsv";
    fetch_method=DataDeps.fetch_base,  # workaround https://github.com/JuliaWeb/HTTP.jl/issues/696
)
oxinabox commented 3 years ago

Closing as there is no action we can take here, needs to be solved in the upstream

greimel commented 3 years ago

Thanks for investigating!