projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

Plink demo #458

Closed dberma15 closed 5 months ago

dberma15 commented 2 years ago

Hi,

Is there a plink demo? I'm looking at the sample notebook provided on the documentation page, but I'm not seeing anything for loading in and displaying a plink file.

Thanks.

williambrandler commented 2 years ago

hey @dberma15 plink binary ped files can be read with Glow. What is your use case for plink files and is the data in any other format (vcf / bgen)?

Cheers

dberma15 commented 2 years ago

Hi @williambrandler, Right now though, I'm just trying to get a demo working. I've found four .bed files on databricks: /databricks-datasets/genomics/grch37/snpEff/examples/intervals.bed /databricks-datasets/genomics/grch37/snpEff/examples/my_annotations.bed /databricks-datasets/genomics/grch38/snpEff/examples/intervals.bed /databricks-datasets/genomics/grch38/snpEff/examples/my_annotations.bed

Each time I try to run the following code:

df = spark.read.format("plink").load(path+".bed".format(prefix=path))
display(df.limit(10))

I get the following error:

FileReadException: Error while reading file dbfs:/databricks-datasets/genomics/grch37/snpEff/examples/intervals.bed. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
Caused by: FileNotFoundException: No such file or directory: s3a://databricks-datasets-oregon/genomics/grch37/snpEff/examples/intervals.fam

Meanwhile, I do not get an error if I run:

vcf_path = "/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
df = spark.read.format("vcf").load(vcf_path)
display(df.limit(10))
williambrandler commented 2 years ago

ah, I believe those SNPeff input files are Browser Extensible Data (BED) format, not plink binary PED (BED) format, which awkwardly has the same suffix.

You can read Browser Extensible Data (BED) format as a tab delimited csv file plink binary PED (BED) format expects an associated .fam (and .bim) file, to learn more see the plink docs

dberma15 commented 2 years ago

@williambrandler that would explain it. I'll try to find some plink files to try this out with and let you know how it goes.