Open mwylerCH opened 6 months ago
Hi, thanks for reporting. I will take a look in this issue to see if there's a bug. Did you try to use the -m
parameter to check for file integrity? I believe that would solve your issue.
Hi, yes, and I forgot the command (with v0.6.3):
NAME=bacteria
$HOME/genome_updater.sh \
-A "species:1" -d "genbank" -g "$NAME" -f "protein.faa.gz" -o "$TEMPDIR/AA_${NAME}" -t 20 -m -L curl
I can further confirm the issue with genomic dna (GCA_037198385.1_ASM3719838v1_genomic.fna.gz). The sequence has a "suppressed" status on NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_037198385.1/). Command:
$HOME/genome_updater.sh \
# -A "species:1" -d "genbank" -g "viral" \
# -l "complete genome" -f "genomic.fna.gz" \
# -o "$TEMPDIR/virus_RefSeq" -t 50 -m -L curl
File content:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Object not found!</title>
<link rev="made" href="mailto:%5bno%20address%20given%5d" />
<style type="text/css"><!--/*--><![CDATA[/*><!--*/
body { color: #000000; background-color: #FFFFFF; }
a:link { color: #0000CC; }
p, address {margin-left: 3em;}
span {font-size: smaller;}
/*]]>*/--></style>
</head>
<body>
<h1>Object not found!</h1>
<p>
The requested URL was not found on this server.
If you entered the URL manually please check your
spelling and try again.
</p>
<p>
If you think this is a server error, please contact
the <a href="mailto:%5bno%20address%20given%5d">webmaster</a>.
</p>
<h2>Error 404</h2>
<address>
<a href="/">ftp.ncbi.nlm.nih.gov</a><br />
<span>Apache</span>
</address>
</body>
</html>
Indeed, if the MD5 file is not available, genome_updater is keeping the file. For now, you can change the following line to return 1
and genome_updater should skip those files. I will fix this bug in the next release.
Dear Dev team, your tool is pretty handy, However, I noticed that he tries to download files that don't exist. In particular, if I'm trying to download the amino acid sequences (eg GCA_016840515.1_ASM1684051v 1_protein.faa.gz) I'm getting a file and no error. However, when looking into it I will find a xml with the following content:
Of course it would be easy to identify these, but I think it's a issue when I'm filtering genomes with for example
-A "species:1"
. Please correct me if I'm missing something Greetings