pjgreer / ukb-rap-tools

Scripts and workflows for use analyzing UK Biobank data from the DNANexus Research Analysis Platform
37 stars 8 forks source link

Too big memory? #15

Closed TrumanZYX closed 9 months ago

TrumanZYX commented 11 months ago

Dear I use wes scropts,

pjgreer commented 11 months ago

Q1:

How many is 30W ? The rvtest QC script is not very CPU intensive, but very demanding of both RAM and Disk space, but mostly disk space. For any scripts that are failing, I would up the disk size of the VM and try again. The QC script has a "default" and a "large chromosome" section for larger chromosomes. The default size uses a 600GB drive, while the larger size uses 1.2TB. You may find that more of the default chromosomes need bumped up to the large size.

There was some prior question that asked about outputting a compressed vcf. The issue there is that the vcf needs to have the "chr" prefix removed from every line of the vcf file. There may be a way to use this straight from plink, BUT THIS IS COMPLETELY UNTESTED. plink will output chr names as numeric only with the "--output-chr 26" flag therby skiiping the sed command, and you can change the "--recode vcf-iid" to "--export vcf bgz" and it SHOULD export out a bgzip vcf file with chromosomes as just a number. You would then need to index the file with tabix. BUT I MUST POINT OUT THAT THIS IS COMPLETELY UNTESTED AND MAY PRODUCE OTHER ISSUES LATER IN THE ANALYSIS, IF IT WORKS AT ALL.

I have chosen the set of commands in this script because I know that they worked for me.

Q2:

The problem with the bgen RSID field in both TOPMED and GEL, is that many entries do not have an ID and are set to a period "." for missing. If you attempt to use the rsID from the bgen file you end up with >10,000 variants with the same name: namely "."
In the 11a script under the TOPMED-plink workflow, all of the names are set to chr:pos:ref:alt just as Allysa Clay-Gilmour described in the last post on the above thread.

In a way that is better than using the RSID, because it explicitly names multiallelic snps in a more correct manner, which is the naming convention that many large bioinformatics databases is moving towards. (see GnomAD)

-Phil

pjgreer commented 10 months ago

The only solution to how to decrease the RAM requirement is using smaller samples. That could be by using a subset of data, or setting the conversion in 10 batches of 30K subjects. Either way you need to use a smaller number of subjects.

Also, running rvtests on 300K samples will take forever.

For the most part, you seldom need more than 50-60K subjects to adequately power an analysis. For the most part, you should try to keep the number of