ropensci / onekp

Access sequences from the 1000 Plant Initiative (1KP)
https://docs.ropensci.org/onekp
Other
13 stars 4 forks source link

Problems downloading sequence files from onekp #3

Closed tonyaseverson closed 4 years ago

tonyaseverson commented 5 years ago

This is the minimal code to recreate the issue

onekp <- retrieve_onekp()
seqs <- filter_by_code(onekp, c('URDJ'))
download_peptides(seqs, 'oneKP/pep')
#> Warning in system(cmd, intern = TRUE): running command '/usr/bin/tar -tf
#> 'oneKP/pep/URDJ.faa.tar.bz2'' had status 1
#> Warning in untar(path, compressed = "bzip2", exdir = dir): '/
#> usr/bin/tar -xf 'oneKP/pep/URDJ.faa.tar.bz2' -C '/var/folders/
#> n9/67cpgppn3n91037xr9f6sfy80000gn/T//RtmprbZgUN/onekp_sequences'' returned
#> error code 1
#>                    6 
#> "oneKP/pep/URDJ.faa"
download_nucleotides(seqs, 'oneKP/nuc')
#> Warning in system(cmd, intern = TRUE): running command '/usr/bin/tar -tf
#> 'oneKP/nuc/URDJ.fna.tar.bz2'' had status 1
#> Warning in untar(path, compressed = "bzip2", exdir = dir): '/
#> usr/bin/tar -xf 'oneKP/nuc/URDJ.fna.tar.bz2' -C '/var/folders/
#> n9/67cpgppn3n91037xr9f6sfy80000gn/T//RtmprbZgUN/onekp_sequences'' returned
#> error code 1
#>                    7 
#> "oneKP/nuc/URDJ.fna"

Created on 2019-08-21 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.5.1 (2018-07-02) #> os macOS 10.14.5 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_CA.UTF-8 #> ctype en_CA.UTF-8 #> tz America/Vancouver #> date 2019-08-21 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib #> assertthat 0.2.1 2019-03-21 [1] #> backports 1.1.4 2019-04-10 [1] #> bit 1.1-14 2018-05-29 [1] #> bit64 0.9-7 2017-05-08 [1] #> blob 1.2.0 2019-07-09 [1] #> callr 3.3.1 2019-07-18 [1] #> cli 1.1.0 2019-03-19 [1] #> crayon 1.3.4 2017-09-16 [1] #> curl 4.0 2019-07-22 [1] #> DBI 1.0.0 2018-05-02 [1] #> dbplyr 1.4.2 2019-06-17 [1] #> desc 1.2.0 2018-05-01 [1] #> devtools 2.1.0 2019-07-06 [1] #> digest 0.6.20 2019-07-04 [1] #> dplyr 0.8.3 2019-07-04 [1] #> evaluate 0.14 2019-05-28 [1] #> fs 1.3.1 2019-05-06 [1] #> glue 1.3.1 2019-03-12 [1] #> highr 0.8 2019-03-20 [1] #> hoardr 0.5.2 2018-12-02 [1] #> htmltools 0.3.6 2017-04-28 [1] #> httr 1.4.1 2019-08-05 [1] #> knitr 1.23 2019-05-18 [1] #> magrittr 1.5 2014-11-22 [1] #> memoise 1.1.0 2017-04-21 [1] #> onekp * 0.2.2 2019-08-21 [1] #> pillar 1.4.2 2019-06-29 [1] #> pkgbuild 1.0.3 2019-03-20 [1] #> pkgconfig 2.0.2 2018-08-16 [1] #> pkgload 1.0.2 2018-10-29 [1] #> prettyunits 1.0.2 2015-07-13 [1] #> processx 3.4.1 2019-07-18 [1] #> ps 1.3.0 2018-12-21 [1] #> purrr 0.3.2 2019-03-15 [1] #> R6 2.4.0 2019-02-14 [1] #> rappdirs 0.3.1 2016-03-28 [1] #> Rcpp 1.0.2 2019-07-25 [1] #> remotes 2.1.0 2019-06-24 [1] #> rlang 0.4.0 2019-06-25 [1] #> rmarkdown 1.14 2019-07-12 [1] #> rprojroot 1.3-2 2018-01-03 [1] #> RSQLite 2.1.2 2019-07-24 [1] #> rvest 0.3.4 2019-05-15 [1] #> selectr 0.4-1 2018-04-06 [1] #> sessioninfo 1.1.1 2018-11-05 [1] #> stringi 1.4.3 2019-03-12 [1] #> stringr 1.4.0 2019-02-10 [1] #> taxizedb 0.1.9.9130 2019-08-21 [1] #> testthat 2.2.1 2019-07-25 [1] #> tibble 2.1.3 2019-06-06 [1] #> tidyselect 0.2.5 2018-10-11 [1] #> usethis 1.5.1 2019-07-04 [1] #> vctrs 0.2.0 2019-07-05 [1] #> withr 2.1.2 2018-03-15 [1] #> xfun 0.8 2019-06-25 [1] #> xml2 1.2.2 2019-08-09 [1] #> yaml 2.2.0 2018-07-25 [1] #> zeallot 0.1.0 2018-01-28 [1] #> source #> CRAN (R 3.5.1) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.1) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> Github (ropensci/onekp@6eace96) #> CRAN (R 3.5.2) #> CRAN (R 3.5.1) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> Github (ropensci/taxizedb@8ee0ab9) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> #> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
tonyaseverson commented 5 years ago
arendsee commented 5 years ago

@tfsevers88 Thanks for the report. The bz2 extension implies that the files are compressed with bzip2, but they are not. This must be a recent change since the code used to work.

I'll write the data maintainer and ask if they can either recompress the files or make the extensions consistent with the compression method.

In the meantime, I'll write a temporary workaround.

arendsee commented 5 years ago

Actually, I think they may have just fixed it? The code runs now on my system ...

Delete your cached data (the folder onekp downloaded), run the code again, and tell me how it goes.

tonyaseverson commented 5 years ago

Thanks, but I'm still getting similar errors. This time no directories are created and no files are downloaded.

Here is the reprex:

onekp <- retrieve_onekp()
seqs <- filter_by_code(onekp, c('URDJ','PDIE'))
download_peptides(seqs, 'oneKP/pep')
#> Warning in system(cmd, intern = TRUE): running command '/usr/bin/tar -tf
#> 'oneKP/pep/URDJ.faa.tar.bz2'' had status 1
#> Warning in untar(path, compressed = "bzip2", exdir = dir): '/
#> usr/bin/tar -xf 'oneKP/pep/URDJ.faa.tar.bz2' -C '/var/folders/
#> n9/67cpgppn3n91037xr9f6sfy80000gn/T//RtmpqY7nwy/onekp_sequences'' returned
#> error code 1
#> Warning in system(cmd, intern = TRUE): running command '/usr/bin/tar -tf
#> 'oneKP/pep/PDIE.faa.tar.bz2'' had status 1
#> Warning in untar(path, compressed = "bzip2", exdir = dir): '/
#> usr/bin/tar -xf 'oneKP/pep/PDIE.faa.tar.bz2' -C '/var/folders/
#> n9/67cpgppn3n91037xr9f6sfy80000gn/T//RtmpqY7nwy/onekp_sequences'' returned
#> error code 1
#>                    6                 3930 
#> "oneKP/pep/URDJ.faa" "oneKP/pep/PDIE.faa"
download_nucleotides(seqs, 'oneKP/nuc')
#> Warning in system(cmd, intern = TRUE): running command '/usr/bin/tar -tf
#> 'oneKP/nuc/URDJ.fna.tar.bz2'' had status 1
#> Warning in untar(path, compressed = "bzip2", exdir = dir): '/
#> usr/bin/tar -xf 'oneKP/nuc/URDJ.fna.tar.bz2' -C '/var/folders/
#> n9/67cpgppn3n91037xr9f6sfy80000gn/T//RtmpqY7nwy/onekp_sequences'' returned
#> error code 1
#> Warning in system(cmd, intern = TRUE): running command '/usr/bin/tar -tf
#> 'oneKP/nuc/PDIE.fna.tar.bz2'' had status 1
#> Warning in untar(path, compressed = "bzip2", exdir = dir): '/
#> usr/bin/tar -xf 'oneKP/nuc/PDIE.fna.tar.bz2' -C '/var/folders/
#> n9/67cpgppn3n91037xr9f6sfy80000gn/T//RtmpqY7nwy/onekp_sequences'' returned
#> error code 1
#>                    7                 3931 
#> "oneKP/nuc/URDJ.fna" "oneKP/nuc/PDIE.fna"

Created on 2019-08-22 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.5.1 (2018-07-02) #> os macOS 10.14.5 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_CA.UTF-8 #> ctype en_CA.UTF-8 #> tz America/Vancouver #> date 2019-08-22 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib #> assertthat 0.2.1 2019-03-21 [1] #> backports 1.1.4 2019-04-10 [1] #> bit 1.1-14 2018-05-29 [1] #> bit64 0.9-7 2017-05-08 [1] #> blob 1.2.0 2019-07-09 [1] #> callr 3.3.1 2019-07-18 [1] #> cli 1.1.0 2019-03-19 [1] #> crayon 1.3.4 2017-09-16 [1] #> curl 4.0 2019-07-22 [1] #> DBI 1.0.0 2018-05-02 [1] #> dbplyr 1.4.2 2019-06-17 [1] #> desc 1.2.0 2018-05-01 [1] #> devtools 2.1.0 2019-07-06 [1] #> digest 0.6.20 2019-07-04 [1] #> dplyr 0.8.3 2019-07-04 [1] #> evaluate 0.14 2019-05-28 [1] #> fs 1.3.1 2019-05-06 [1] #> glue 1.3.1 2019-03-12 [1] #> highr 0.8 2019-03-20 [1] #> hoardr 0.5.2 2018-12-02 [1] #> htmltools 0.3.6 2017-04-28 [1] #> httr 1.4.1 2019-08-05 [1] #> knitr 1.23 2019-05-18 [1] #> magrittr 1.5 2014-11-22 [1] #> memoise 1.1.0 2017-04-21 [1] #> onekp * 0.2.2 2019-08-21 [1] #> pillar 1.4.2 2019-06-29 [1] #> pkgbuild 1.0.3 2019-03-20 [1] #> pkgconfig 2.0.2 2018-08-16 [1] #> pkgload 1.0.2 2018-10-29 [1] #> prettyunits 1.0.2 2015-07-13 [1] #> processx 3.4.1 2019-07-18 [1] #> ps 1.3.0 2018-12-21 [1] #> purrr 0.3.2 2019-03-15 [1] #> R6 2.4.0 2019-02-14 [1] #> rappdirs 0.3.1 2016-03-28 [1] #> Rcpp 1.0.2 2019-07-25 [1] #> remotes 2.1.0 2019-06-24 [1] #> rlang 0.4.0 2019-06-25 [1] #> rmarkdown 1.14 2019-07-12 [1] #> rprojroot 1.3-2 2018-01-03 [1] #> RSQLite 2.1.2 2019-07-24 [1] #> rvest 0.3.4 2019-05-15 [1] #> selectr 0.4-1 2018-04-06 [1] #> sessioninfo 1.1.1 2018-11-05 [1] #> stringi 1.4.3 2019-03-12 [1] #> stringr 1.4.0 2019-02-10 [1] #> taxizedb 0.1.9.9130 2019-08-21 [1] #> testthat 2.2.1 2019-07-25 [1] #> tibble 2.1.3 2019-06-06 [1] #> tidyselect 0.2.5 2018-10-11 [1] #> usethis 1.5.1 2019-07-04 [1] #> vctrs 0.2.0 2019-07-05 [1] #> withr 2.1.2 2018-03-15 [1] #> xfun 0.8 2019-06-25 [1] #> xml2 1.2.2 2019-08-09 [1] #> yaml 2.2.0 2018-07-25 [1] #> zeallot 0.1.0 2018-01-28 [1] #> source #> CRAN (R 3.5.1) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.1) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> Github (ropensci/onekp@6eace96) #> CRAN (R 3.5.2) #> CRAN (R 3.5.1) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> Github (ropensci/taxizedb@8ee0ab9) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.2) #> CRAN (R 3.5.2) #> CRAN (R 3.5.0) #> CRAN (R 3.5.0) #> #> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
tonyaseverson commented 5 years ago

When I try to download the files from the onekp_public_data.html page, I see this:

image

I had coded a workaround, but it would work to download one file, but then seemed to time out on subsequent downloads. Perhaps the technical difficulties Google reports with virus scanning is the root cause?

arendsee commented 5 years ago

You're right about the root problem. When I first wrote onekp, the files were all served through FTP. Then the maintainers moved them to Google Drive.

We are probably going to need some cookies. There is a stackoverflow question that addresses this issue. Adapting the code from Tanaike:

#!/bin/bash
fileid="1GrB19Tl87zAbpqh3wgO8NCi9xR371MZq"
filename="data.tar.bz2"

url1="https://drive.google.com/uc?export=download&id=${fileid}"
echo $url1
curl -c cookie -s -L $url1 > /dev/null

url2="https://drive.google.com/uc?export=download&confirm=`awk '/download/ {print $NF}' cookie`&id=${fileid}"
echo $url2
curl -Lb cookie $url2 -o ${filename}

This seems to work. We can implement this solution in R using the RCurl library (e.g. see this solution) If you like, you can try to get this working and submit a pull request. Alternatively, I can come back to it sometime next week (I'm booked till Monday, at least).

tonyaseverson commented 5 years ago

@arendsee: I'm a newbie to R, and haven't done package development before, so I'm not sure how far I would get and I'm running out of time before classes resume to make progress on other things. This isn't a blocker for me - I manually downloaded what I needed. I'd be happy to test if that would be of use, though, and will watch this issue.

gkoczyk commented 4 years ago

Another problem in the same vein, however now it appears some identifiers yield 403 - Forbidden, as if google drive mapping was off (checked in CyVerse, both example identifiers available in public dataset).

> seqs2 <- filter_by_code(onekp, c('MYMP', 'ZSSR'))
> download_peptides(seqs2, 'pep2')
trying URL 'https://drive.google.com/uc?export=download&id=111S43yNcrFvDwCA9Gr0IZ9dv9FZrMCiS'
downloaded 74 KB

bzip2: (stdin) is not a bzip2 file.
/bin/tar: Child returned status 2
/bin/tar: Error is not recoverable: exiting now
bzip2: (stdin) is not a bzip2 file.
/bin/tar: Child returned status 2
/bin/tar: Error is not recoverable: exiting now
trying URL 'https://drive.google.com/uc?export=download&id=1xFt4gVlvbhSVK7-FLfTkfYckP_Sq57kk'
Content type 'application/x-bzip2' length 46 bytes
==================================================
downloaded 46 bytes

           3264            3300 
"pep2/ZSSR.faa" "pep2/MYMP.faa" 
Warning messages:
1: In system(cmd, intern = TRUE) :
  running command '/bin/tar -tf 'pep2/ZSSR.faa.tar.bz2'' had status 2
2: In untar(path, compressed = "bzip2", exdir = dir) :
  ‘/bin/tar -xf 'pep2/ZSSR.faa.tar.bz2' -C '/tmp/Rtmp7aEZy9/onekp_sequences'’ returned error code 2
arendsee commented 4 years ago

@gkoczyk I'll check up on this. The shell script above seems to still work. I'll see if I can implement the same behavior in R.

arendsee commented 4 years ago

@gkoczyk My last commit should have fixed the problem. If not, you may reopen the issue. My fix may have broken Windows compatibility.

arendsee commented 4 years ago

It ends up all the cookie shenanigans were quite unnecessary, all I needed to add to the curl command was the -L option that allows redirects to be followed. But now I use a system call to curl, which is probably not portable.