Quality control on unsplitted pVCF

dianacornejo commented 1 year ago

Hello @eugenegardner, so I'm trying to do some QC on the pVCF exome data and I'm looking at the approach you took. I wonder if you have an idea on how much would be the cost of analyzing the pVCF chuncks as they were provided by the UKB-RAP. Also how do you calculate instance allocation of resources? I'm very new to analyzing data in the RAP systems and I'm still unsure on how to do this. Thanks in advance!

ejgardner-insmed commented 1 year ago

Hello,

Depending on if you can get low or high priority instances (see https://dnanexus.gitbook.io/uk-biobank-rap/working-on-the-research-analysis-platform/managing-job-priority) it can range from ~£700 - ~£3000.

I would recommend taking a look at my other repository to see how the full workflow proceeds, and supplement this information with information from the individual READMEs as you go along.

To be clear, you will need some computational expertise to be able to do this, but I've tried to simplify the process as much as possible. I see your other comment in #3, and I would advise that you brush up on basic UNIX/BASH/Terminal before working on this project.

dianacornejo commented 1 year ago

@ejgardner-adrestia thanks for your reply. I have some computational experience just not use to cloud computing. All of our previous pipelines were developed in SoS and in tested and ran in a cluster. So translating everything to WDL seems like a task that's going to take a while to learn and also allocating the correct computing seems also like a hard job (i've been experimenting with small working examples and I got the to run in RAP so I guess that's good news).

I have another question. Would you not recommend running the pVCF as UKB provided them? I understand thats a 2.5x increase in sample size but how much more resources does it take to run? Not sure if you started your analysis by processing these big chunks and if you can share a little of your experience. I would a appreciate a lot.

Thank you

ejgardner-insmed commented 1 year ago

Hello @dianacornejo,

Apologies for the incorrect assumption and I shouldn't have assumed, it was just a bit difficult to tell from the other issue you posted.

I wouldn't consider the RAP 'cloud computing' in the traditional AWS sense. I kind of consider it abstracted cloud computing, as DNANexus allows you to click and point your way to some analyses, or to bypass this and write your own applets (as I have done) which requires a fairly deep knowledge of how to interface with the dxpy APYI and app toolkit. My hope is the QC workflow bit I mention above can provide a reasonable interface to replicate what I have done.

I did not use the pVCF as UKB provided them and instead generated this 'bcfsplitter applet' as they:

are too big to debug when I was doing initial development.
would take a very long time to run, require a lengthy time on a single cloud instance, and thus cost more money.
potentially crash VEP when loading in >5k variants since VEP is written in perl and generally has bad memory management for files this large.

You may be able to, in theory, skip the splitbcf step and run the pVCF directly through, but your milage may vary. Happy to answer any more questions you may have.

mrcepid-rap / mrcepid-bcfsplitter

Quality control on unsplitted pVCF #2