Phage genome download from ncbi

ovpop100 commented 6 years ago

Hey,

I realized that downloading the phage genomes from NCBI via (eutils.ncbi.nlm.nih.gov) is not working with the perl module (LWP::SIMPLE). I'm not sure why, is it because https ?? I could solve the problem using the command $xml=qx{wget --quiet --output-document=- "$url"}; instead of the get($url). For now it works, but I'm wondering if anyone else have the same problem.

best

Ovidiu

gustavo11 commented 6 years ago

Hi @ovpop100 , thanks for detecting this.... and for providing the solution.

No, no one has reported this problem yet. But its good to know so we can implement your recommended fix.

Wondering... Is there anything special with the server/machine and configuration where your running ProphET that might have caused that (firewall,etc)?

Have you installed Mozilla::CA module? There were some issues with LWP::Simple in case Mozilla::CA was not installed.

ovpop100 commented 6 years ago

Hey, no I don't have the Mozilla::CA, I will try it and let you know if it makes some differences. I observed, that the LWP::Simple is not working with any https url. I check it with several web pages and it doesn't work. It works pretty fine with http urls.

cheers

gustavo11 commented 6 years ago

Some additional information...

We maintain a continuous integration app testing every build of ProphET. That means that the latest version of CPAN packages are being tested for every build of ProphET in a virtual machine. At least in that VM configuration the combination of the latest LWP::Simple with Mozilla::CA is working properly without any issue. But I'm very curious about the issue that you have reported. Please tell me if you find more...

microbioticajon commented 5 years ago

Hi All,

I had a similar problem after a little hacking with LWP::UserAgent and observing the response it looks as if LWP::Simple does not handle https out of the box. Installing LWP::Protocol::https seemed to resolve the issue for me (requires libssl-dev headers to compile).

EDIT: Mozilla::CA covers this dependancy so ignore the above :)

As a side note. LWP::Simple does not capture the response and will not natively raise an exception when the request fails. If you could swap this out for LWP::UserAgent and raise exceptions on failed requests it would a) make debugging easier b) prevent the install.pl script from silently building empty databases.

Hope this helps. J

linsalrob commented 4 years ago

Downloading the databases using the INSTALL.pl script does not work.

The first part of the errors are:

Looking for required programs in the enviroment PATH...
        Found EMBOSS extractseq: /home3/redwards/anaconda3/envs/prophet/bin/extractseq
        Found blastall: /home3/redwards/anaconda3/envs/prophet/bin/blastall
        Found blastall: /home3/redwards/anaconda3/envs/prophet/bin/formatdb
        Found bedtools: /home3/redwards/anaconda3/envs/prophet/bin/bedtools
Saving program paths in ./config.dir/Third_party_programs_paths.log ...
Looking for required Perl libraries...
Downloading GFFLib ...
Cloning into 'UTILS.dir/GFFLib'...
Creating database directory...
Creating database temp directory ...
Downloading Phage sequences ...
Downloading Myoviridae from Genbank (NCBI) ...
Retrieving representatives of virus family Myoviridae, TaxID 10662 ...
Number of records in Genome database: 
Number of genomes under TaxID 10662: 0
Extracting segments of polyproteins and coding sequences...
Error: Failed to open filename '10662.gb'
Error: Unable to read sequence '10662.gb'
Died: extractfeat terminated: Bad value for '-sequence' with -auto defined
Can't open 10662.sense: No such file or directory.
Error: Failed to open filename '10662.gb'
Error: Unable to read sequence '10662.gb'
Died: extractfeat terminated: Bad value for '-sequence' with -auto defined
Can't open 10662.antisense: No such file or directory.
Extracting all other protein coding features ...
Error: Failed to open filename '10662.gb'
Error: Unable to read sequence '10662.gb'
Died: extractfeat terminated: Bad value for '-sequence' with -auto defined
grep: 10662.all_features: No such file or directory
Retrieving the featured product for each CDS or mat_peptide ...
Translating genes ...
Error: Unable to read sequence '10662.fasta'
Died: transeq terminated: Bad value for '-sequence' with -auto defined
Removing * representing STOP codons ...

------------- EXCEPTION -------------
MSG: Could not read file '10662.prot': No such file or directory
STACK Bio::Root::IO::_initialize_io /home3/redwards/anaconda3/envs/prophet/lib/perl5/site_perl/5.22.0/Bio/Root/IO.pm:270
STACK Bio::SeqIO::_initialize /home3/redwards/anaconda3/envs/prophet/lib/perl5/site_perl/5.22.0/Bio/SeqIO.pm:499
STACK Bio::SeqIO::fasta::_initialize /home3/redwards/anaconda3/envs/prophet/lib/perl5/site_perl/5.22.0/Bio/SeqIO/fasta.pm:87
STACK Bio::SeqIO::new /home3/redwards/anaconda3/envs/prophet/lib/perl5/site_perl/5.22.0/Bio/SeqIO.pm:375
STACK Bio::SeqIO::new /home3/redwards/anaconda3/envs/prophet/lib/perl5/site_perl/5.22.0/Bio/SeqIO.pm:421
STACK toplevel ../../UTILS.dir/fasta2line:18
-------------------------------------

I expect this is because of the https issue, but NCBI moved to https on Nov 9th, 2016.

jaumlrc / ProphET

Phage genome download from ncbi #30