pjgreer / ukb-rap-tools

Scripts and workflows for use analyzing UK Biobank data from the DNANexus Research Analysis Platform
37 stars 8 forks source link

what is the difference between these files #12

Closed TrumanZYX closed 11 months ago

TrumanZYX commented 11 months ago

Dear, do you know if it is possible

pjgreer commented 11 months ago

Based on the info here: https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=170

ukb23157 is in pVCF format (.vcf.gz .vcf.gz.tbi) ukb23158 is in plink format (.bed .bim .fam) ukb23159 is in BGEN format (.bgen .bgi, .sample)

Both the bgen and the plink format files were created directly from the .pVCF file. Therefore, they should not materially differ from each other. Other than the bgen and pVCF will have data on read depth and QC that the plink file discarded.

Plink and REGENIE both prefer data file in plink format (either the older bed/bim/fam or the newer pgen/pvar/psam format). Both programs can read the .bgen format, but it is slower because they have to convert it internally to plink format.

If you are doing this to filter the data on QC or read depth, then you should make an intermediate script where you filter the bgen or pVCF file and save it out to a plink file for use in the actual analysis.

pjgreer commented 11 months ago

I haven't fully explored the bgen format nor most of the tools that use it. You need to look into the following programs to see if they will do any of the things you wish to achieve. If they do not, you may have to go back to the pVCF files and filter them using vcftools, samtools, or bcftools. You can then create your new plink files from there.

QCtool https://www.well.ox.ac.uk/~gav/qctool_v2/

or BGENIX https://enkre.net/cgi-bin/code/bgen/doc/trunk/doc/wiki/bgenix.md

All of these tools are already included in Swiss-Army-Knife