smsrts / EricScript

Updated EricScript from https://sites.google.com/site/bioericscript/
GNU General Public License v3.0
1 stars 2 forks source link

Issues with data download #1

Open gurpreet-bioinfo opened 3 years ago

gurpreet-bioinfo commented 3 years ago

Hi @smsrts,

Thanks for the updated ericscript.pl

I ran ericscript.pl --printdb and it's taking forever.

Output:

Selected Ensembl version: 1 
Installed Ensembl version: No database installed 
Available reference IDs:

Then, also tried:

ericscript.pl --downdb --refid homo_sapiens --ensversion 104

Output

Current Ensembl version: 104 
Installed Ensembl version: No database installed 
Available reference IDs:
ericscript.pl --downdb --refid homo_sapiens

Output:

Current Ensembl version: 104 
Installed Ensembl version: No database installed 
Available reference IDs:

[EricScript] Downloading homo_sapiens data. This process may take from few minutes to few hours depending on the selected genome ...
[EricScript] Error: No data available for genome homo_sapiens. Run ericscript.pl --printdb to view the available genomes.
[EricScript] Removing temporary files ...done.

Thanks.

jfass commented 3 years ago

Yah it seems the Ensembl folder structure / names are not what's expected by the download code, for release 84 and 104 (that I've checked)

jfass commented 3 years ago

Though the problem I was seeing was not exactly like what you were seeing, @gurpreet-bioinfo, (I saw a failure to download the genome sequence: "Error in download.file(file.path("ftp://ftp.ensembl.org/pub", paste("release-", : cannot open URL 'ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz'"), I fixed this error (so far) by editing the download.file command at the end of the lib/R/DownloadDB.R script from:

download.file(file.path("ftp://ftp.ensembl.org/pub", paste("release-", ensversion, sep = ""), "fasta", myrefid, "dna", myrefid.path), destfile = file.path(tmpfolder, "seq.fa.gz"), quiet = T)

to:

tempcmd <- paste( "cd ", tmpfolder, " && curl ftp://ftp.ensembl.org/pub/release-", ensversion, "/fasta/", myrefid, "/dna/", myrefid.path, " > seq.fa.gz", sep="" )
system( tempcmd )

There's got to be a smarter way to do that, but it fixed the issue for me, allowing the main Perl script to keep running ...

Joshuasync commented 1 year ago

Yah it seems the Ensembl folder structure / names are not what's expected by the download code, for release 84 and 104 (that I've checked)

Hi,

I'm facing a similar issue. I managed to get ericscript.pl --printdb to work, but it's listing only few organisms for me. I checked the ".ftplist1" file and it appears to be incomplete. Could you please let me know from where I can get this file or is there any pre-built reference available for GRCh37? (I've checked this link (https://sites.google.com/site/bioericscript/download), but it's not available now.)

ericscript.pl --printdb

Warning message:
In readLines(file.path(ericscriptfolder, "lib", "data", "_resources",  :
  incomplete final line found on '/media/syncNGS/miniconda3/envs/ericscript/share/ericscript-0.5.5-5/lib/data/_resources/.ftplist1'
Current Ensembl version: 110 
Installed Ensembl version: No database installed 
Available reference IDs:
     acanthochromis_polyacanthus 
     accipiter_nisus 
     ailuropoda_melanoleuca 
     amazona_collaria 
     amphilophus_citrinellus 
     amphiprion_ocellaris 
     amphiprion_percula 
     anabas_testudineus 
     anas_platyrhynchos