pirovc / genome_updater

Bash script to download/update snapshots of files from NCBI genomes repository (refseq/genbank) with track of changes and without redundancy
MIT License
144 stars 14 forks source link

Empty *faa|missing files #95

Open mwylerCH opened 6 months ago

mwylerCH commented 6 months ago

Dear Dev team, your tool is pretty handy, However, I noticed that he tries to download files that don't exist. In particular, if I'm trying to download the amino acid sequences (eg GCA_016840515.1_ASM1684051v 1_protein.faa.gz) I'm getting a file and no error. However, when looking into it I will find a xml with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Object not found!</title>
<link rev="made" href="mailto:%5bno%20address%20given%5d" />
<style type="text/css"><!--/*--><![CDATA[/*><!--*/
    body { color: #000000; background-color: #FFFFFF; }
    a:link { color: #0000CC; }
    p, address {margin-left: 3em;}
    span {font-size: smaller;}
/*]]>*/--></style>
</head>

<body>
<h1>Object not found!</h1>
<p>

    The requested URL was not found on this server.

    If you entered the URL manually please check your
    spelling and try again.

</p>
<p>
If you think this is a server error, please contact
the <a href="mailto:%5bno%20address%20given%5d">webmaster</a>.

</p>

<h2>Error 404</h2>
<address>
  <a href="/">ftp.ncbi.nlm.nih.gov</a><br />
  <span>Apache</span>
</address>
</body>
</html>

Of course it would be easy to identify these, but I think it's a issue when I'm filtering genomes with for example -A "species:1". Please correct me if I'm missing something Greetings

pirovc commented 6 months ago

Hi, thanks for reporting. I will take a look in this issue to see if there's a bug. Did you try to use the -m parameter to check for file integrity? I believe that would solve your issue.

mwylerCH commented 6 months ago

Hi, yes, and I forgot the command (with v0.6.3):

NAME=bacteria
$HOME/genome_updater.sh \
   -A "species:1" -d "genbank" -g "$NAME" -f "protein.faa.gz" -o "$TEMPDIR/AA_${NAME}" -t 20 -m -L curl
mwylerCH commented 6 months ago

I can further confirm the issue with genomic dna (GCA_037198385.1_ASM3719838v1_genomic.fna.gz). The sequence has a "suppressed" status on NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_037198385.1/). Command:

$HOME/genome_updater.sh \
#   -A "species:1" -d "genbank" -g "viral" \
#   -l "complete genome" -f "genomic.fna.gz" \
#   -o "$TEMPDIR/virus_RefSeq" -t 50 -m -L curl

File content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Object not found!</title>
<link rev="made" href="mailto:%5bno%20address%20given%5d" />
<style type="text/css"><!--/*--><![CDATA[/*><!--*/
    body { color: #000000; background-color: #FFFFFF; }
    a:link { color: #0000CC; }
    p, address {margin-left: 3em;}
    span {font-size: smaller;}
/*]]>*/--></style>
</head>

<body>
<h1>Object not found!</h1>
<p>

    The requested URL was not found on this server.

    If you entered the URL manually please check your
    spelling and try again.

</p>
<p>
If you think this is a server error, please contact
the <a href="mailto:%5bno%20address%20given%5d">webmaster</a>.

</p>

<h2>Error 404</h2>
<address>
  <a href="/">ftp.ncbi.nlm.nih.gov</a><br />
  <span>Apache</span>
</address>
</body>
</html>
pirovc commented 6 months ago

Indeed, if the MD5 file is not available, genome_updater is keeping the file. For now, you can change the following line to return 1 and genome_updater should skip those files. I will fix this bug in the next release.

https://github.com/pirovc/genome_updater/blob/78c3fb546cdca726b333900f5319ab03e03681e4/genome_updater.sh#L608

mwylerCH commented 6 months ago

Sorry that I'm repling only now, but I still get the same error, even with:

if [ -z "${ftp_md5}" ]; then
                echolog "${file_name} MD5checksum file not available [${md5checksums_url}] - FILE KEPT"  "0"
                return 1
            else

run with the same command as stated above