guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
137 stars 51 forks source link

Broken download when supplementary_files is empty or contains invalid URLs #38

Closed antonkulaga closed 5 years ago

antonkulaga commented 6 years ago

When I download GSM supplementary files by:

gsm = cast(GSM, GEOparse.get_GEO("GSM1944823", destdir="/tmp"))
files = gsm.download_supplementary_files("/tmp", False, "antonkulaga@gmail.com")

I get the following error

13-Feb-2018 18:02:51 DEBUG utils - Directory /tmp/Supp_GSM1944823_MG_UKJ_30_190214_1HS_brain already exists. Skipping.
13-Feb-2018 18:02:51 INFO utils - Downloading NONE to /tmp/Supp_GSM1944823_MG_UKJ_30_190214_1HS_brain/NONE
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/GEOparse/GEOTypes.py", line 443, in download_supplementary_files
    utils.download_from_url(metavalue[0], download_path)
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/GEOparse/utils.py", line 114, in download_from_url
    destination_path))
  File "/home/antonkulaga/rna-seq/containers/geoparse/env/lib/python3.6/site-packages/wgetter.py", line 272, in download
    url = opener.open(link)
  File "/usr/lib/python3.6/urllib/request.py", line 511, in open
    req = Request(fullurl, data)
  File "/usr/lib/python3.6/urllib/request.py", line 329, in __init__
    self.full_url = url
  File "/usr/lib/python3.6/urllib/request.py", line 355, in full_url
    self._parse()
  File "/usr/lib/python3.6/urllib/request.py", line 384, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'NONE'
guma44 commented 6 years ago

Hi, The problem here is that the entries do not have supplementary files ie. The list of supplementary files is composed of one entry: 'None'. I see you do not want to download SRA so this is not an issue for you.

antonkulaga commented 6 years ago

@guma44 in such case the behavior is not consistent. If for some GSM there are no supplementary files, then it should download nothing and return an empty array of paths. My use-case is using GEOparse to download all avaliable files from GSE for my RNA-Seq pipeline, I do not think that covering everything with try-except blocks or puting a lot of if-else to check if there are any supplementary files is efficient. Returning an empty array looks way more logical...

guma44 commented 6 years ago

And this is exactly what try-except block does. As can be seen, no one can guarantee that user will not put some nonsense in the metadata. Thus, the code tries to download it but if it cannot, it adds nothing to the return dictionary. Thus, as a result, is an error logging message and empty dictionary. I can actually change Exception to ValueError to be more strict on what error is caught.

antonkulaga commented 6 years ago

It is not clear why it does not download at least sra when I tell to download supplements. At least sra does exist in GSM1944808

import os
import re
from pprint import pprint
from typing import *

import GEOparse
from GEOparse import *
from GEOparse import utils
from functional import *

gsm = cast(GSM, GEOparse.get_GEO("GSM1944808", destdir="/tmp"))

filetype = 'sra'
keep_sra = True
fastq_dump_options = {
    'skip-technical': None,
    'clip': None,
    'split-files': None,
    'readids': None,
    'read-filter': 'pass',
    'dumpbase': None,
    'gzip': None
    }
sra_kwargs = {
    "keep_sra": keep_sra,
    'filetype': filetype,
    "fastq_dump_options": fastq_dump_options
}
directory_path = "~/test"
gsm.download_supplementary_files(directory_path, True, "antonkulaga@gmail.com", sra_kwargs)

I get

Downloading NONE to /pipelines/text/Supp_GSM1944808_MG_UKJ_15_190214_1HS_brain/NONE

while I at least expect sra to be downloaded as I sad download_sra=True

guma44 commented 6 years ago

SRA, in this case, is not listed in the supplementary files. It is just listed as a relation but not as a supplementary file. This is a separate issue and it would be worth to check relations for that but for now, this functionality does not exist. Overall, it does not say why it did not download SRA because there is no SRA unfortunately :/. Anyway, I think I know how to solve it.

antonkulaga commented 6 years ago

I've tried to download SRA with

path = gsm.download_SRA("antonkulaga@gmail.com", "/pipelines/test", **sra_kwargs)

The sra was downloaded (and saved as Supp_GSM1944808_MG_UKJ_15_190214_1HS_brain ) but path is [] I think the function should return the path to downloaded file(s) instead of an empty array

guma44 commented 6 years ago

Yes, that is the bug. download_SRA is working but not as expected. I will fix it.