open2c / distiller-nf

A modular Hi-C mapping pipeline
MIT License
87 stars 24 forks source link

sra download seems to be broken now - some of the ftp links lead nowhere #145

Open sergpolly opened 5 years ago

sergpolly commented 5 years ago

@Marlies1993 was running a distiller with some SRA-s as an input and the pipeline kept crashing at the sra step... After closer inspection it appears that some of the links of this form: https://github.com/mirnylab/distiller-nf/blob/01f6f7bbc4b1edfc3634c131f709b08a40164c74/distiller.nf#L176 are broken ...

for example, take SRR027959 from 2009 hic paper:

venevs@vangogh ➜  ~ wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR027/SRR027959/SRR027959.sra                 
--2019-11-13 18:12:32--  ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR027/SRR027959/SRR027959.sra
           => ‘SRR027959.sra.1’
Resolving ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)... 130.14.250.10, 2607:f220:41e:250::11
Connecting to ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)|130.14.250.10|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /sra/sra-instant/reads/ByRun/sra/SRR/SRR027/SRR027959 ... 
No such directory ‘sra/sra-instant/reads/ByRun/sra/SRR/SRR027/SRR027959’.

I don't know enough about sra-s and why are we downloading them using wget - anyone ?

@Marlies1993 can comment and provide other examples if needed

Phlya commented 5 years ago

Had the same problem. Here is what SRA say about it. Screenshot_20191113-234245

meoomen commented 5 years ago

Thanks @Phlya! When I checked the presence of the sra links one by one, I also found that some of them were missing from ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR. When I removed the sra's that weren't present from my project.yml, my distiller project is running fine, but this doesn't solve the problem of course...

Phlya commented 5 years ago

I just downloaded those missing ones manually... Wget download is much faster than the regular sra tools, but maybe in case of this problem distiller should fall back to fastq-dump?

meoomen commented 5 years ago

Yes, I figured I will have to do manual download for now as well. Thanks!

golobor commented 5 years ago

https://github.com/mirnylab/distiller-nf/commit/2f259f58a69b3063d453a86d4bd5552c4d2c8d4c

this update will make distiller use fastq-dump if wget fails. Does it look good? Any other fixes we could implement to the downloading process (i.e. try multiple URLs), while we're at it?

On Thu, 14 Nov 2019 at 00:58, Marlies Oomen notifications@github.com wrote:

Yes, I figured I will have to do manual download for now as well. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mirnylab/distiller-nf/issues/145?email_source=notifications&email_token=AAG64CRSM3HHVBSNNIOVLJDQTSID5A5CNFSM4JNDFOJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEADBTA#issuecomment-553660620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG64CUCJCV6RLSAGD252QLQTSID5ANCNFSM4JNDFOJQ .

golobor commented 5 years ago

Btw, here is another trick to force using fastq-dump, in project.yml specify input as: library1: lane1:

sergpolly commented 5 years ago

Thank you @golobor and @Phlya ! that was uber quick!

Does it look good?

we are about to try it - we'll let you know here

Any other fixes we could implement to the downloading process (i.e. try multiple URLs), while we're at it?

hmmmm - I do it so rarely that I don't really know what to say ... maybe @Phlya have suggestions ? MirnyLab people ? If anything i would like a reminder why aren't we doing it the nextflow way ? https://www.nextflow.io/docs/edge/channel.html#fromsra https://www.nextflow.io/blog/2019/release-19.03.0-edge.html - is it because we didn't have time to do it - or because there is something wrong with it ?

golobor commented 5 years ago

maybe it's worth switching, anyone is interested in implementing? :) i think, there was a time where it was returning non-gzipped files, but there has been much improvement lately, so we should probably consider switching.

On Thu, 14 Nov 2019 at 16:06, Sergey Venev notifications@github.com wrote:

Thank you @golobor https://github.com/golobor and @Phlya https://github.com/Phlya ! that was uber quick!

Does it look good?

we are about to try it - we'll let you know here

Any other fixes we could implement to the downloading process (i.e. try multiple URLs), while we're at it?

hmmmm - I do it so rarely that I don't really know what to say ... maybe @Phlya https://github.com/Phlya have suggestions ? MirnyLab people ? If anything i would like a reminder why aren't we doing it the nextflow way ? https://www.nextflow.io/docs/edge/channel.html#fromsra https://www.nextflow.io/blog/2019/release-19.03.0-edge.html - is it because we didn't have time to do it - or because there is something wrong with it ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mirnylab/distiller-nf/issues/145?email_source=notifications&email_token=AAG64CUKZPLUEMYK2TT5UMLQTVSORA5CNFSM4JNDFOJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEECEK2A#issuecomment-553928040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG64CTIJKHXJFY3NY4OMGDQTVSORANCNFSM4JNDFOJQ .

sergpolly commented 5 years ago

worked for me on the original 2009 Hi-C data, - I guess @Marlies1993 would report here once she tries it as well: Screenshot from 2019-11-14 13-07-01

Thank you, again!

maybe it's worth switching, anyone is interested in implementing? :)

sounds fun to me - should simplify some of the distiller.nf code - not sure about timeline requirements though ...

meoomen commented 5 years ago

Thanks for fixing this so quickly! All my sra's are downloading and mapping as well.

golobor commented 4 years ago

a more reliable fix: https://github.com/mirnylab/distiller-nf/commit/55b5e6e1235d200cb132be1c198ffc493f988909