ncbi / sra-tools

SRA Tools
Other
1.1k stars 242 forks source link

no clip option for fasterq-dump #952

Open paulzierep opened 1 month ago

paulzierep commented 1 month ago

We discovered, that fastq files downloaded from NCBI SRA via fasterq_dump are different to the ENA stored fastq files. After some digging, this is probably due to the --clip option.

Example downloaded from: https://www.ebi.ac.uk/ena/browser/view/DRR010705

@DRR010705.1 HUMWT9A01AC2YA/4
ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATCTTGCGCTCCTTGGTATTCCTTGGAGCATGCCTGTTTGAGTATCATGAGCAAATCTCAAAGTCAATTCCTTAATTGGTTTTGCTTTGGACTTGGAGGTCTTGCAGATTTCACAGTCTGCTCCTCTTAAATGCATTAGCTGGATCTCAGTAATTATGCTTGGTTCCACTCGGCGTGATAAGTATCACTCGCTGAGGACACTGTTAAAAAGGTGGCCAGGAAATTACTGATTGAACCGCTTCTAACGGTCTATTAAGTTGGACAATTGACCCCTTAAGTTTGATCTCAAATCAGGTAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGAAAAGAAACCAACAGGGATTGCCTTAGTAACGGCGGGTGAAGCGGCAACAGCTCAAATTTGAAATCTGGCTCTTTCAGGGTCCGAGTTGTAATTTGTAGAAGT
+
EIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHFIIIIIIIIIIIIIIHBBBHDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHCCEECCBBIIIIIIIIADDIIICCEIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDDCHIIIIIIIIIIIIIIIIIIIIIIHDAADDIIAAA;;AAIAADDACIIAAAA@IIICCAECCICAAACAAAAIBBBBA>>>??@????AA899999;@;;????A?87777<A=:666=<<<;444;<AB996=;AA<<99999<?==;;;8331021..,,,..0..,,,//000.,,,,//1////1186/...1353;8<:7733357:8:777555544841111233310011464333331101440,,,,,.-444221

Default download via fasterq_dump

@HUMWT9A01AC2YA/4
ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATCTTGCGCTCCTTGGTATTCCTTGGAGCATGCCTGTTTGAGTATCATGAGCAAATCTCAAAGTCAATTCCTTAATTGGTTTTGCTTTGGACTTGGAGGTCTTGCAGATTTCACAGTCTGCTCCTCTTAAATGCATTAGCTGGATCTCAGTAATTATGCTTGGTTCCACTCGGCGTGATAAGTATCACTCGCTGAGGACACTGTTAAAAAGGTGGCCAGGAAATTACTGATTGAACCGCTTCTAACGGTCTATTAAGTTGGACAATTGACCCCTTAAGTTTGATCTCAAATCAGGTAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGAAAAGAAACCAACAGGGATTGCCTTAGTAACGGCGGGTGAAGCGGCAACAGCTCAAATTTGAAATCTGGCTCTTTCAGGGTCCGAGTTGTAATTTGTAGAAGTAG
+
EIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHFIIIIIIIIIIIIIIHBBBHDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHCCEECCBBIIIIIIIIADDIIICCEIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDDCHIIIIIIIIIIIIIIIIIIIIIIHDAADDIIAAA;;AAIAADDACIIAAAA@IIICCAECCICAAACAAAAIBBBBA>>>??@????AA899999;@;;????A?87777<A=:666=<<<;444;<AB996=;AA<<99999<?==;;;8331021..,,,..0..,,,//000.,,,,//1////1186/...1353;8<:7733357:8:777555544841111233310011464333331101440,,,,,.-44422100

Unfortunately, there seems to be no clip parameter for fasterq-dump. Any idea how to generate identical reads as the ones stored in ENA ?

See also: https://github.com/galaxyproject/tools-iuc/issues/6171 as we're trying to use that for Galaxy.

wraetz commented 1 month ago

Unfortunately fasterq-dump does not support a clip option, but the older fastq-dump does.

durbrow commented 1 month ago

In general, without reproducing the options used, you can't compare the result of two runs of fastq-dump or fasterq-dump. Does EBI document what options they used when generating the fastq file you downloaded from them?

Is there a reason you need clipping in fasterq-dump besides reproducing the file from EBI?