seqcode / pegr

Platform for Eukaryotic Genome Regulation
MIT License
3 stars 1 forks source link

Running analysis pipeline without access to full sequencer output? #227

Open shaunmahony opened 3 years ago

shaunmahony commented 3 years ago

Does PEGR support cases where the data has been generated on a third-party sequencer (e.g., a core facility) and the PEGR users/admins don't have access to the complete sequencer output? For example, many sequencing cores just provide the fastq files. Can the PEGR analysis pipeline be configured to run on a directory containing fastq files where the RunCompletionStatus.xml file never appears? I don't think this use-case is described on the wiki pages.

grettadarmstrong commented 3 years ago

Thanks for asking and pointing out missing elements in the wiki pages so that we can address it! I hope you don't mind, I'll answer the questions with what I know and set the label for 'documentation' so that we can add it to the wiki pages.

Two questions: 1st part - Yes and I should explain further. We have that situation here now where we are using the services of a core sequencing center on campus that provides access to the fast file results via either a login to their file systems (we have a service account for PEGR that is able to log in and pull files), OR you can use the Globus access point and actually that is how we pull the data from their services and over to ACI-ICDS (via Globus) into our folders for the Galaxy server on ROAR to be able to process the workflow pipelines.

2nd part - Yes, so in this case any workflows can simply be kicked off manually. And if you are asking about this, then I am guessing that the instructions for doing that are also missing from the wiki?? Having said that, I believe that it could be configured to recognize a new subfolder in the designated folder - instead of looking for a "RunCompletionStatus.xml" file. The latter is something I would want @dshao to comment on. I believe PEGR tracks now the last folder that it scanned and used to download files from the sequencer. So I think there could be functionality to support a slightly different way to automate intake and workflows. I'm not sure if that benefits you at all. If you are manually moving over data into a subfolder, then clicking a button to start up workflows doesn't seem like an issue.

PLEASE let me know if I didn't address any or all of your questions. I'm happy clarify further if needed. And thanks for pointing to holes in current instructions.

note: we have yet to avail ourselves of the supported API functionality through Globus to automate this process through that software. That could also be another way to support checking a known folder path and if a new subfolder exists, then moving over sequencing data. That is our plan for this Fall. This is only a benefit if the core sequencing service has a Globus endpoint, but those are so easy to do, and PSU also has full Globus licensing, so it could be an option that I would highly recommend. Moving data files, for dozens of human samples, can still incur data loss due to transfer issues over the network, even there on campus. Having the automated checksums and verifications provided to us through Globus via just scp or curl is a nice benefit and easy sanity check.

shaunmahony commented 3 years ago

Thanks, Gretta! I'll hopefully get to test this mode out myself when I have an instance up and running. So I'll asssign this issue to myself and add to documentation when I get more insight. Leaving open in the meantime