hackseq / hackseq_2017

Public files and discussions about hackseq 2017
3 stars 0 forks source link

Compute resources #11

Closed sjackman closed 7 years ago

sjackman commented 7 years ago

Lauren wrote…

Hi Shaun!

I'm starting to look into computational requirements for this year's HackSeq and Jasleen told me you took the lead on this last year.

I was hoping to get any feedback you have in terms of what you had set up last year (I understand you used a GSC-based cluster), what the requirements were like, what was good/bad about it, and if you have any other advice. Jasleen also told me you were able to get some credits from Amazon web services, can you give me a bit more info about that as well? Was that used for computational processing as well?

If you want to meet in person for coffee this week and chat instead of talking by email, just let me know.

Thanks for your help! Lauren

sjackman commented 7 years ago

Hi, Lauren. This'll be a quick response. I'll write you a more detailed response tomorrow. Coffee or lunch would be great. Read or skim this GitHub issue: https://github.com/hackseq/October_2016/issues/6

sjackman commented 7 years ago

Hi, Lauren. We used Amazon AWS EC2 and the BCGSC Docker service called ORCA. ORCA had the lowest barrier to entry, just ssh in, and it's ready to go with a bunch of preinstalled bioinformatics software. The system got overloaded last year due to improper load balancing. Hopefully that's better this year, but it's still going to be a shared system, so there's potential for one group to take the machine down and affect other groups. Last year ORCA didn't offer X11 forwarding (ssh -X) for things like IGV. It would be best to get that fixed for this year.

AWS is more robust in that manner, since each team has their own instance, and they're not sharing resources. With AWS we didn't offer any training services, so it worked well for the team leaders that already had AWS experience, but not so much otherwise. Before the event, each team leader should start up their AWS instance, install software and download data. They can then save that disk and suspend the instance (so it's not costing too much) until the actual event.

My team mostly used their own laptops if they had Macs, and used ORCA if they had Windows laptops. Using your own laptops works great so long as the data is small, and they have a way of sharing between themselves (GitHub, Dropbox, Airdrop, et c). If the data is large, it can quickly saturate the WiFi network. Ensure that the participants have downloaded the necessary software and data to their laptops before they arrive at the event.

Cheers, Shaun

sjackman commented 7 years ago

Apply for AWS credits through https://aws.amazon.com/research-credits/ Our application from last year can be found at https://github.com/hackseq/October_2016/issues/6#issuecomment-229832732 http://calculator.s3.amazonaws.com/index.html#r=PDX&s=EC2&key=calc-98B78ADC-E35B-45D2-B120-B953469DF96C

lchong commented 7 years ago

Hey Shaun, thanks for all the help so far.

I forgot to ask you, who would be the primary contact at GSC to discuss ORCA usage?

sjackman commented 7 years ago

No worries at all. The web page for ORCA is http://www.bcgsc.ca/services/orca I've sent you more contact info by e-mail.

lchong commented 7 years ago

Of the 10 applications, only 3 specified HPC requirements.

Machine requests: 6 m4.large (2 CPU/8Gb RAM), 2 m4.large, none Hard drive sizes per machine (Gb): 50, 5, 10 Data sizes (Gb): 10, 5, 1

Since the m4.large is the smallest machine and one team requested 6, I'm going to request larger machines so that each team can theoretically share a single instance. The smallest machine that meets at least 6 x m4.large requirements is the m4.4xlarge (16 CPU/64Gb RAM). I'll request 14 of these (in case each team ends up needing one, plus a few extra). Also going to request 4 of the m4.10xlarge (40 CPU/160Gb RAM) in case anyone has underestimated.

For hard drives, requesting 14 1Tb machines and 4 2Tb machines. Again, this is almost certainly overkill based on the applications, but I'd rather be over-prepared.

Last year Shaun also specified data transfer of up to 5Tb in and out, so I'll keep this the same in case some teams end up wanting to transfer large datasets.

Final cost calculator estimate: http://calculator.s3.amazonaws.com/index.html#key=calc-99694B7A-C600-4949-B142-A47C10B099D6&r=PDX&s=EC2

lchong commented 7 years ago

Submitted the application on June 29:

We are organizing the second annual Hackseq Genomics Hackathon (http://www.hackseq.com), which will take place from Oct 20-22 in Vancouver, Canada. This year we will have 10 teams of approximately 5-10 participants, for an estimated total of 50-100 participants.

The project leaders have estimated the compute requirements for each of their projects. The total estimated requirements for all projects is 14 m4.4xlarge and 4 m4.10xlarge instances, with 14 1Tb hard drives and 4 2Tb hard drives. This will provide an instance for each team that is powerful enough for them to share, plus a few additional machines in case any teams have underestimated their requirements. The AWS resources will mainly be required for the 72-hour hackathon event, but will also require a bit of setup (software installation and data transfer) prior to the event.

Teams will collaborate through GitHub, and all projects and results will be publicly accessible. We will share updates and results through social media (primarily Twitter) and the hackseq website, and we plan to submit a report to F1000 summarizing the event and the projects, as was done last year (https://f1000research.com/articles/6-197/v2).

Many participants will not have used AWS before, and this workshop will introduce those participants to AWS. The participants may choose to continue to use AWS after the workshop has completed.

I have provided my personal AWS account number, but could also set up a new AWS account for hackseq going forward.

Thank you in advance for your consideration, Lauren Chong - on behalf of the hackseq steering committee

Keywords: bioinformatics, hackathon, open science, collaboration

lchong commented 7 years ago

Some notes about meeting with the GSC ORCA team (Aug 25, 2017):

sjackman commented 7 years ago

@tmozgach Can you please point Lauren (@lchong) to the list of software installed on ORCA?

tmozgach commented 7 years ago

@lchong here is the current list of software: https://github.com/bcgsc/orca/blob/master/versions.tsv