guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
137 stars 51 forks source link

fastq-dump parameters are not optimal #26

Closed antonkulaga closed 7 years ago

antonkulaga commented 7 years ago

I run fastq-dump with the following parameters:

 /opt/sratoolkit/fastq-dump --skip-technical --gzip --readids --read-filter pass --dumpbase --split-files --clip ${file}

(at https://edwards.sdsu.edu/research/fastq-dump/ there are good explanations for need in some of them). While default geoparse has

cmd = "fastq-dump --split-files --gzip %s --outdir %s %s"

That creates some problems. For instance, if I do not have --readids and use paired sra, I get two files with ideas that are the same, that creates problem for downstream analysis. If I do not provide --skip-technical, then I get some technical Illumina reads that have nothing to do with biology ( like Application Read Forward -> Technical Read Forward <- Application Read Reverse - Technical Read Reverse. ) --read-filter pass allows to get read of multiple N-s in reads

guma44 commented 7 years ago

Hey, Thanks for this comments. For me it was working like that but I think your proposition is a valid improvement. I will implement it.

guma44 commented 7 years ago

So my proposition is to add a "fastq_dump_options" argument to the download_SRA function. It would be in a form of a dictionary where a user could as long forms of arguments with corresponding values as a key: value pair. eg.

fastq_dump_options = {
     'split-files': None,
    'readids': None,
    'read-filter': 'pass',                                                                                                                                                                                                       
    'dumpbase': None
 }

If the value is not truthy the option would be passed without any value. Then anybody would be able to override these values.

What do you think?

guma44 commented 7 years ago

Hey, As there was no comment on this I implemented it in the way I said before.