ropensci / biomartr

Genomic Data Retrieval with R
https://docs.ropensci.org/biomartr
215 stars 29 forks source link

download of genomes crashing at same step #6

Closed ARamesh123 closed 1 year ago

ARamesh123 commented 7 years ago

Hi,

I am trying to download all bacterial proteomes from NCBI and the download process crashes at the exact same step with the following error:

Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/075/GCA_000174075.1_ASM17407v1/GCA_000174075.1_ASM17407v1_protein.faa.gz' currently available? Execution halted

This is my R script

!/usr/bin/env Rscript

library(biomartr) meta.retrieval(kingdom = "bacteria", db = "genbank", type = "proteome")

This same exact problem also occurs when I download viruses. Would any of you have any idea on what's going on? There is definitely no internet connection problem.

-A

HajkD commented 7 years ago

Hi @ARamesh123

Thank you so much for making me aware of this problem.

I will try to figure out what the problem is and how to fix it.

It couldn't be due to the number of queries to NCBI which then somehow blocks the IP address due to their query policy? I tried to avoid this by adding a Sys.sleep(0.33) after each genome was retrieved. The NCBI policy is to not perform more than 3 queries per second.

But it is more likely to be a bug since it seems to always error at the same step.

Anyway, I will try to figure this out.

Again, many thanks for your help! I highly appreciate it :)

Best wishes, Hajk

HajkD commented 7 years ago

Just a short update:

I could reproduce the error. I also get

Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/075/GCA_000174075.1_ASM17407v1/GCA_000174075.1_ASM17407v1_protein.faa.gz' currently available?

It didn't happen a few weeks ago when I tested..

Ok, it must be something that changed in the assembly_summary files from NCBI that I don't catch..

I will work on it and come back to you.

Many thanks!

Hajk

ARamesh123 commented 7 years ago

Thank you for responding!!! I'll keep my eye out for this.

HajkD commented 7 years ago

Hi @ARamesh123,

I managed to find the problem and fixed it: see NEWS.

You can now download the developer version of biomartr to retrieve all bacterial proteomes from Genbank.

The problem for viruses etc is also fixed.

In detail, the problem was that when you go to "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/075/GCA_000174075.1_ASM17407v1/" you will find that for the species Buchnera aphidicola str. LSR1 Acyrthosiphon pisum (which caused the error) no proteome file is available. However, genome, gff, etc files are available. When I wrote the biomartr retrieval functions I assumed that NCBI stores all information in their curated databases such as RefSeq and Genbank whenever a genome assembly is available (genomes, proteomes, cds, gff, etc.), because CDS, proteome, gff can all be generated from the genome assembly by using several annotation pipelines. However, in reality for some species genome assemblies have not been screened for CDS and then translated to proteins. This caused that biomartr tried to follow the NCBI naming convention for NCBI proteome naming although no proteome was stored for that particular species.

I now implemented the helper function exists.ftp.file() to check if the actual file is stored on NCBI and added this existence check in all get*() functions (see e.g. here).

I am very sorry if this bug caused some problems, but I hope that now that it's fixed, all retrieval functions can unfold their full potential.

Again many thanks for making me aware of this problem. This way I can improve biomartr so that more and more users can benefit from it's functionality.

I would be very happy if you could let me know if things work fine for you now?

Best wishes, Hajk

ARamesh123 commented 7 years ago

Thank you for the fix! It works really well now-I've downloaded sequences of interest.

HajkD commented 7 years ago

I am very happy that it works now :)

I just submitted the new version of biomartr to CRAN, so that the bug fix and new functionality will be available from CRAN too.

Best wishes, Hajk

dychangfeng commented 7 years ago

Hi Hajk,

First I want to thank you a lot for developing this package.

I am trying to use biomartr to download a whole group of bacteria. However, I always run into similar problems here.

Here is the function I used: meta.retrieval( group= "Actinobacteria", kingdom = 'bacteria', db = 'genbank', type = 'genome', path = '~/bacteria/') And I always have this error: Error in value[3L] : Something went wrong with the connection to: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/129/665/GCA_900129665.1_IMG-taxon_2617270824_annotated_assembly/

Do you have any clue what went wrong here? I download the developer version of biomartr today.

Thank you!

Yun

dychangfeng commented 7 years ago

I tried another group, it went wrong again. Here is the message: Starting meta retrieval of all genome files within kingdom 'bacteria' and subgroup 'Thermodesulfobacteria'. Starting retrieval of Caldimicrobium thiodismutans ... Error in value[3L] : Something went wrong with the connection to: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/548/275/GCF_001548275.1_ASM154827v1/ Calls: meta.retrieval ... tryCatch -> tryCatchList -> tryCatchOne ->

dychangfeng commented 7 years ago

This is another trial: Starting meta retrieval of all genome files within kingdom 'bacteria' and subgroup 'Chlamydiae'. Generating folder ~/ncbi_genomes/Chlamydiae/ ... Starting retrieval of Candidatus Protochlamydia amoebophila ... Checking md5 hash of file: ~/ncbi_genomes/Chlamydiae//Candidatus_Protochlamydia_amoebophila_md5checksums.txt ... Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/565/GCF_000011565.1_ASM1156v1/GCF_000011565.1_ASM1156v1_genomic.fna.gz' currently available?

HajkD commented 7 years ago

Hi @dychangfeng

Thank you so much for making me aware of this issue.

Since the file 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/565/GCF_000011565.1_ASM1156v1/GCF_000011565.1_ASM1156v1_genomic.fna.gz can be manually downloaded there must be a bug somewhere.

It's the first thing I will take care of tomorrow morning.

I will keep you posted.

Many thanks and best wishes, Hajk

HajkD commented 7 years ago

Hi @dychangfeng,

I am currently running the function meta.retrieval( group= "Actinobacteria", kingdom = 'bacteria', db = 'genbank', type = 'genome') and it is currently downloading smoothly all files.

Could you please try to rerun the command now?

Since I also had the same issue as you described last night I assume that maybe the NCBI FTP server was down again.... Unfortunately, this happens quite frequently. There are several reasons for their downtime, e.g. maintenance, overload of queries, etc. So sometimes it helps to simply rerun the command on the next day. Unfortunately, there is not much I can do concerning the server downtime issue.

On my side, I already have on my TODO list the implementation of a better error message infrastructure when this kind of downtime issues occur. Unfortunately, the publish-or-perish pressure keeps me quite busy lately, so I don't have so much time for actual tool development. But I will do my best to keep improving biomartr and I am happy about any help such as yours making me aware of these issues :)

Since I am now downloading all Actinobacteria right now, please drop me a personal email and I am happy to send you the genomes I retrieved with biomartr.

Please keep me posted.

Many thanks! Hajk

HajkD commented 7 years ago

P.S.: Sometimes the download process stops because the NCBI servers clearly seem to block too many queries in a row. In that case, you can simply rerun the same command meta.retrieval( group= "Actinobacteria", kingdom = 'bacteria', db = 'genbank', type = 'genome') and already downloaded files will be recognised and skipped and the download will continue to download where it left off.

I hope this helps :)

dychangfeng commented 7 years ago

Thank you, Hajk-Georg!

I tried again this morning. I think you are right. The downloading process will stop at different files. And it is still not finished.

Could you send me the file you have? If possible?

Do you have ideas how to download different families of bacteria genomes?

Best,

Yun Ding

University of Utah

U.S.A.


From: Hajk-Georg Drost notifications@github.com Sent: Thursday, August 10, 2017 4:39:09 AM To: ropensci/biomartr Cc: dychangfeng; Mention Subject: Re: [ropensci/biomartr] download of genomes crashing at same step (#6)

Hi @dychangfenghttps://github.com/dychangfeng,

I am currently running the function meta.retrieval( group= "Actinobacteria", kingdom = 'bacteria', db = 'genbank', type = 'genome') and it is currently downloading smoothly all files.

Could you please try to rerun the command now?

Since I also had the same issue as you described last night I assume that maybe the NCBI FTP server was down again.... Unfortunately, this happens quite frequently. There are several reasons for their downtime, e.g. maintenance, overload of queries, etc. So sometimes it helps to simply rerun the command on the next day. Unfortunately, there is not much I can do concerning the server downtime issue.

On my side, I already have on my TODO list the implementation of a better error message infrastructure when this kind of downtime issues occur. Unfortunately, the publish-or-perish pressure keeps me quite busy lately, so I don't have so much time for actual tool development. But I will do my best to keep improving biomartr and I am happy about any help such as yours making me aware of these issues :)

Since I am now downloading all Actinobacteria right now, please drop me a personal email and I am happy to send you the genomes I retrieved with biomartr.

Please keep me posted.

Many thanks! Hajk

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ropensci/biomartr/issues/6#issuecomment-321516032, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AP2NZImaDoHQiRJFowoZw7hmOeciWQFGks5sWt3NgaJpZM4MaDYE.

HajkD commented 7 years ago

Hi @dychangfeng,

Sorry for my late reply.

I was trying to find out how to best send you the data. Could you please send me your email address in a private email (and not through the GitHub system) and then I send you a download link via pCould Transfer.

Unfortunately, the NCBI or Ensembl databases don't provide the family information for bacteria. So I cannot parse it from the information they provide. The only thing that could be implemented in biomartr is a sub filtering where I would have to retrieve bacteria family member names from a different database. Which database do you usually use in the bacteria community?

Many thanks and best wishes, Hajk

dychangfeng commented 7 years ago

Hi Hajk,

I understand everyone of us are busy these days. Thank you for your reply!

I found a way to separate different group and subgroups of bacteria from https://www.ncbi.nlm.nih.gov/genome/browse/. Once I choose bacteria as a kingdom, I can download all the genomes information with groups and subgroups. Then I can use R or python to merge different table.

Thank you for all your help!

Best,

Yun

Genome List - National Center for Biotechnology Informationhttps://www.ncbi.nlm.nih.gov/genome/browse/ www.ncbi.nlm.nih.gov External link. Please review our privacy policy.. NLM. NIH


From: Hajk-Georg Drost notifications@github.com Sent: Tuesday, August 15, 2017 4:34:47 AM To: ropensci/biomartr Cc: dychangfeng; Mention Subject: Re: [ropensci/biomartr] download of genomes crashing at same step (#6)

Hi @dychangfenghttps://github.com/dychangfeng,

Sorry for my late reply.

I was trying to find out how to best send you the data. Could you please send me your email address in a private email (and not through the GitHub system) and then I send you a download link via pCould Transfer.

Unfortunately, the NCBI or Ensembl databases don't provide the family information for bacteria. So I cannot parse it from the information they provide. The only thing that could be implemented in biomartr is a sub filtering where I would have to retrieve bacteria family member names from a different database. Which database do you usually use in the bacteria community?

Many thanks and best wishes, Hajk

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ropensci/biomartr/issues/6#issuecomment-322434590, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AP2NZNNEqtqkRuaMoAOyuOY7U2-lvwG2ks5sYXRHgaJpZM4MaDYE.

ghost commented 7 years ago

I am having the following error when I am trying to download human proteome.

Starting retrieval of Homo sapiens ... |===================================================================================| 100% 27 MB ----------> No reference genome or representative genome was found for 'Homo sapiens'. Thus, download for this species has been omitted.This is the script:

file_path <- getProteome( db = "refseq",

  • organism = "Homo sapiens",
  • path = file.path("_ncbi_downloads","proteomes") )

In need of help. Thanks

HajkD commented 7 years ago

Hi @kalmeshv

Thank you so much for making me aware of this issue.

I now tracked it down to a wrong filtering in all get*() functions.

So far, I implemented:

FoundOrganism <-
                dplyr::filter(
                    AssemblyFilesAllKingdoms,
                    stringr::str_detect(organism_name, organism),
                    ((refseq_category == "representative genome") ||
                         (refseq_category == "reference genome")
                    ),
                    (version_status == "latest")
                )

The issue was that I used the wrong OR || condition when filtering. It seems that Homo sapiens now got two entries in the assemblysummary.txt file that I parse from NCBI RefSeq and then the filter condition || did not select the correct Homo sapiens entry.

I now fixed this issue in all get*() functions and correctly filter using the binary OR condition | instead of the vectorized version ||.

When you download the developer version of biomartr with

source("http://bioconductor.org/biocLite.R")
biocLite("HajkD/biomartr")

you should now be able to download the human proteome by running:

biomartr::getProteome(organism = "Homo sapiens")

Please let me know if it now works for you and I will immediately submit the new version of biomartr to CRAN.

Kind regards, Hajk

HajkD commented 7 years ago

Hi Hajk,

I am having trouble downloading the entire bacteria kingdom proteome from both genbank and refseq. I am running the windows developer version of biomartr and keep having the error Error in

value[[3L]](cond) :Something went wrong with the connection to: NA/ when I get to Campylobacter fetus subsp. fetus. 

I'm not sure if its a bug or if it just from the NCBI servers. I was wondering if there's a way to repeat the script from where it last left off to move past the error.

Edit: I have continued to run the meta.retrieval proteome and it does stop at different points which I believe is from NCBI server disconnecting but the the farthest I have gotten is to this point here.

Starting retrieval of Escherichia coli O157:H7 str. Sakai ...
Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/'
                                cannot be reached. Are you connected to the
                                internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/865/GCA_000008865.1_ASM886v1/GCA_000008865.1_ASM886v1_protein.faa.gz' currently available?

I have run this for a week now and when running for the longest time it only gets here and disconnects. Hopefully this helps you diagnose my earlier issue.

Thanks Again, Jacob MacWilliams

HajkD commented 7 years ago

Hi Jacob,

Thank you so much for running some trials. Your results help me actually a lot for trouble shooting.

I think that there is a max count for download queries per IP address implemented on the NCBI server side. I will try to find a way to maybe surpass this max count constraint.

Kind regards, Hajk

aappaagh commented 6 years ago

Hi Hajk, First off - thanks for the useful library. Unfortunately, it seems having some issues along this thread unresolved.

First bug:

the command meta.retrieval(db = "refseq", type = "proteome", kingdom = "protozoa") stumbles always on the same organism with the error Starting proteome retrieval of 'Entamoeba histolytica HM-1:IMSS' from refseq ...

Proteome download is completed! Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/208/925/GCF_000208925.1_JCVI_ESG2_1.0/GCF_000208925.1_JCVI_ESG2_1.0protein.faa.gz' currently available? In addition: Warning message: It seems like there are some files in download folder that are neither pre-downloaded species files nor doc or md5checksum files.

The file is there and I can download it manually. So, it must be something else stopping the download.

Second bug:

When downloading bacterial proteomes with the command meta.retrieval(db = "refseq", type = "proteome", kingdom = "bacteria") it stumbles on Starting proteome retrieval of 'Campylobacter fetus subsp. fetus' from refseq ...

|====================================================================================================| 100% 35 MB Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'NA/NAprotein.faa.gz' currently available? In addition: Warning message: It seems like there are some files in download folder that are neither pre-downloaded species files nor doc or md5checksum files.

which is a clear bug with parsing.

Can you please check them out? Thanks, Alexey

aappaagh commented 6 years ago

On more bug to report. The command meta.retrieval(db = "refseq", type = "proteome", kingdom = "viral")

Essentially fails to retrieve majority of viral proteomes. Here is the typical response:

Starting proteome retrieval of 'Aeromonas virus phiO18P' from refseq ...

----------> No reference proteome or representative proteome was found for 'Aeromonas virus phiO18P'. Thus, download for this organism has been omitted. Have you tried to specify getProteome(db = 'refseq', organism = 'Aeromonas virus phiO18P' , reference = FALSE) ? Alternatively, you can retrieve proteomes using the NCBI accession ID or NCBI Taxonomy ID. See '?'is.genome.available' for examples.

I randomly reviewed a few viruses and the all have _protein.faa.gz file in the latest assembly. It must be something wrong with parsing assembly_summary files...

HajkD commented 6 years ago

Hi @aappaagh

Thank you for making me aware of some new issues.

Regarding issue number 1:

It seems that a new organism was uploaded to NCBI Protozoa which doesn't follow the scientific naming convention. In the scientific naming nomenclature special characters are not defined: see Entamoeba histolytica HM-1:IMSS. The special characters - and : mess up the parsing. I will have a look at this and try to fix this.

Regarding issue number 2:

Have you tried re-running the command? Does it stop at the same stage? I will have a look at this as well.

Regarding issue number 3:

Have you tried specifying the argument reference = FALSE ?

meta.retrieval(db = "refseq", type = "proteome", kingdom = "viral", reference = FALSE)

Many thanks, Hajk

aappaagh commented 6 years ago

Hi Hajk, Thanks for the quick response. Per your responses:

  1. Yes, it looks like you need to add parsing rules for special characters...
  2. Yes, many times. The script stops exactly on the same organisms, after all attempts.
  3. That seems having helped. Thanks! Alexey
nash-claire commented 5 years ago

Hi Hajk,

I noticed this thread has gone quiet but I seem to be replicating the problem of connecting to ftp to download homo sapiens genome files. Commands I've tried as follows......

With ensembl...

getGenome(db = "ensembl", organism = "Homo sapiens", path = file.path("pathtofile","hsGenome"))

Error was as follows.......

Starting genome retrieval of 'Homo sapiens' from ensembl ... Error in curl::curl_fetch_memory(url) : Timeout was reached: FTP response timeout

Trying with ncbi.......

getGenome(db = "genbank", organism = "Homo sapiens", path = file.path("pathtofile","hsGenome"))

Error.....

trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_other/assembly_summary.txt' Content type 'unknown' length 128795 bytes (125 KB)

Completed! Now continue with species download ... Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.27_GRCh38.p12/GCA_000001405.27_GRCh38.p12_genomic.fna.gz' currently available?

I'm not trying to download fancy species genomes or anything here so not sure what's wrong.

Just to point out too, after reading this thread I removed biomartr and then reinstalled the development version as per above put it didn't help. I've also tried running these 2 commands multiple times so I'm not sure it's an issue of servers on their end. Any thoughts?

HajkD commented 5 years ago

Hi Claire,

I am sorry to hear that you are having troubles downloading genomes.

I ran your command and it worked perfectly fine. Could you please give me some specifications about the system you use to run biomartr (MacOS, Linux, Windows - on a server or PC) so that I can help you troubleshoot?

Usually, this happens when the firewall doesn't allow you internet access. It seems that curl itself seems to be unable to establish a stable internet connection to the NCBI server.

Many thanks and best wishes, Hajk

nash-claire commented 5 years ago

Hi Hajk,

Thank you so much for getting back and sorry for my late reply. Christmas has been busy!

So I'm running biomartr on linux bionic beaver 18.04. I have R version 3.4.4 and I'm using Rstudio version 1.1.463 for linux x86_64.

Do you need any more specific details or command printouts?

Thanks again!

On Tue, 18 Dec 2018 at 14:41, Hajk-Georg Drost notifications@github.com wrote:

Hi Claire,

I am sorry to hear that you are having troubles downloading genomes.

I ran your command and it worked perfectly fine. Could you please give me some specifications about the system you use to run biomartr (MacOS, Linux, Windows - on a server or PC) so that I can help you troubleshoot?

Usually, this happens when the firewall doesn't allow you internet access. It seems that curl itself seems to be unable to establish a stable internet connection to the NCBI server.

Many thanks and best wishes, Hajk

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/biomartr/issues/6#issuecomment-448344492, or mute the thread https://github.com/notifications/unsubscribe-auth/AVjYMmnH1JbZgTaNZt8atnZj_H0M-K7jks5u6UTigaJpZM4MaDYE .

-- Kind Regards,

Claire Nash, PhD

Research Scientist

Email: nash.claire@gmail.com Tel: (001) 514-557-2217

HajkD commented 5 years ago

Hi Claire,

perfect. Many thanks.

You also ruled out the possibility that your server has a firewall configuration that doesn't allow you to use wget or curl command access to NCBI?

E.g. does the following shell command work for you?

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.fna.gz

Have you tried running the biomartr command from a different network (e.g. from home on your home laptop)?

This way I will be able to assess what might be the problem.

Many thanks and best wishes, Hajk

nash-claire commented 5 years ago

Hi Hajk,

I appreciate you getting back to me. Just to confirm, I've been running all of this code to date on my laptop at home (I don't have access to a server or anything like that).

The wget command you suggested works fine for me. I also tried this from the terminal

curl -o ~/assembly_summary.txt "ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_other/assembly_summary.txt"

This also worked and downloaded the summary text file no problem. I'm still having the same issue with the ensembl ftp too.

A friend who works with linux and curl for a living suggested I disable firewalls using

sudo ufw disable

which I did and it didn't help. I've also disabled apparmor to no avail. He suggested I might have an internet connection stability problem as it looks as though the point of failure is part way through rather than a complete failure. For this reason I'm going to try again using a different internet connection/service tomorrow and see if it helps.

I'd love to hear if you have any other ideas though!

nash-claire commented 5 years ago

Hi Hajk (and for anyone else reading),

After all the trouble, I determined that it was an internet connection/stability problem I had. When used a physical ethernet connection to the router, the function worked fine.

I'm sorry to have caused the trouble. I'm still a newbie with all this!

But hey, I hope it maybe helps someone else avoid the same trivial problem some day!!!

Thanks for the help!

HajkD commented 5 years ago

Hi Claire,

I am very happy that the issue is resolved now and no worries at all :)

Many thanks for taking the time to write in such great detail for the benefits of future readers.

Kind wishes, Hajk

eggrandio commented 5 years ago

Hi HajkD,

Thanks for developing this package, it is really useful!

I am finding a problem downloading Arabidopsis thaliana CDS from GenBank

test = getCDS(db = "genbank", organism = "3702", reference = F) Starting CDS retrieval of '3702' from genbank ...

CDS download is completed! Warning: 31 parsing failures. row col expected actual file 1 -- 3 columns 2 columns '_ncbi_downloads/CDS/3702_md5checksums.txt' 2 -- 3 columns 2 columns '_ncbi_downloads/CDS/3702_md5checksums.txt' 3 -- 3 columns 2 columns '_ncbi_downloads/CDS/3702_md5checksums.txt' 4 -- 3 columns 2 columns '_ncbi_downloads/CDS/3702_md5checksums.txt' 5 -- 3 columns 2 columns '_ncbi_downloads/CDS/3702_md5checksums.txt' ... ... ......... ......... ........................................... See problems(...) for more details.

Checking md5 hash of file: _ncbi_downloads/CDS/3702_md5checksums.txt ... Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_cds_from_genomic.fna.gz' currently available?

I am able to download the ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_cds_from_genomic.fna.gz file manually, so there must be something going wrong elsewhere.

Thanks !

HajkD commented 5 years ago

Hi @eggrandio

Many thanks for contacting me and I am glad to hear that you find biomartr useful for your research.

I just ran the command and it worked perfectly fine:

getCDS(db = "genbank", organism = "3702", reference = F)
Starting CDS retrieval of '3702' from genbank ...

It seems that this is the first time you run this command for genbank.
Thus, 'assembly_summary.txt' files for all kingdoms will be retrieved from genbank. 
Don't worry this has to be done only once if you don't restart your R session.

trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/assembly_summary.txt'
Content type 'unknown' length 1077794 bytes (1.0 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt'
Content type 'unknown' length 66082664 bytes (63.0 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/assembly_summary.txt'
Content type 'unknown' length 1339745 bytes (1.3 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/invertebrate/assembly_summary.txt'
Content type 'unknown' length 306065 bytes (298 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/assembly_summary.txt'
Content type 'unknown' length 269796 bytes (263 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/protozoa/assembly_summary.txt'
Content type 'unknown' length 213508 bytes (208 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/assembly_summary.txt'
Content type 'unknown' length 188257 bytes (183 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_other/assembly_summary.txt'
Content type 'unknown' length 151045 bytes (147 KB)
==================================================

Completed!
Now continue with species download ...
CDS download of 3702 is completed!
Checking md5 hash of file: _ncbi_downloads/CDS/3702_md5checksums.txt ...
The md5 hash of file '_ncbi_downloads/CDS/3702_md5checksums.txt' matches!
The genomic CDS of '3702' has been downloaded to '_ncbi_downloads/CDS' and has been named '3702_cds_from_genomic_genbank.fna.gz' .
[1] "_ncbi_downloads/CDS/3702_cds_from_genomic_genbank.fna.gz"

Could you please provide me more information about your system so that I might be able to help troubleshoot?

Also, have you tried using the most recent version of biomartr available here from GitHub?

I hope this helps?

Cheers, Hajk

eggrandio commented 5 years ago

Hi @HajkD,

Thanks for your quick reply! I tried it again and it worked, maybe there was some problem with the NCBI ftp server?

Also I had another question, related to downloading CDS datasets.

For example, I am trying to get CDS from species that do not have an annotated genome in NCBI, and also do not seem to have a CDS dataset so I get this error message:

test = getCDS(db = "genbank", organism = "4100", reference = F) Starting CDS retrieval of '4100' from genbank ...

Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/723/945/GCA_000723945.1_Ni_ben/GCA_000723945.1_Ni_ben_cds_from_genomic.fna.gz' currently available?

However, when I do a "manual" search through the NCBI, I am able to find mRNAs with annotated CDSs, for example this one:

https://www.ncbi.nlm.nih.gov/nuccore/MH939184.1?report=genbank

Is there any way of retrieving all the sequences annotated as CDS if there is no NCBI file?

Thanks again!

johanneswerner commented 3 years ago

Unfortunately, the NCBI or Ensembl databases don't provide the family information for bacteria. So I cannot parse it from the information they provide. The only thing that could be implemented in biomartr is a sub filtering where I would have to retrieve bacteria family member names from a different database. Which database do you usually use in the bacteria community?

Is it possible to use the taxid for filtering? With taxonkit for example, it is possible to create a table with taxids, and taxonomic ranks. Such a table could be used to filter on each taxonomic rank of interest (or all subgroups of a specific taxa).

HajkD commented 3 years ago

Very cool idea!!! I saw that taxonkit has a python wrapper pytaxonkit. Is there something similar available for R? Would it be possible to construct some small examples for me so that I can learn the underlying notation and then I can embed it into biomartr. Please also feel free to send me a pull request and I am happy to review it and add it to the main branch. Many thanks!

Roleren commented 1 year ago

This issue thread is too generic, and all relevant "bugs" have been fixed.

I suggest all commentators create a new issue if still needed.

This issue can now be closed.

HajkD commented 1 year ago

Agreed! Let's focus on the individual feature requests that may still be open after our new version release.

Many thanks, Hajk