projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
264 stars 111 forks source link

Running the demo notebook locally #304

Open olszewskip opened 3 years ago

olszewskip commented 3 years ago

Hi! I'd like to learn Glow but I'm having trouble running the demo-notebook (https://glow.readthedocs.io/en/latest/getting-started.html#demo-notebook) using Databricks Community Edition. Is there perhaps a way for me to obtain data used in this notebook? This includes, I believe: vcf_path = "/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" phenotype_path = "/databricks-datasets/genomics/1000G/phenotypes.normalized" sample_info_path = "/databricks-datasets/genomics/1000G/samples/populations_1000_genomes_samples.csv"

I'm guessing these are public datasets, but being new to both Databricks and Glow, I don't know how to download them. Sorry, if this is not the right place to ask for a thing like that. Many thanks for any suggestions!

karenfeng commented 3 years ago

Hi @olszewskip! Thanks for reaching out.

These are public datasets that should be accessible from community edition. Could you provide more detail on how you're trying to access the datasets? You may need to prefix the paths with dbfs: or /dbfs; see the Databricks DBFS documentation for more detail. Our community edition documentation may also be a resource to you.

Let me know if you have any further questions!

olszewskip commented 3 years ago

Thanks so much @karenfeng!

That is very helpful: before having read Your message I was just typing in or copy-pasting python code into blank notebook attached to a Genomics Runtime, but I didn't know e.g. about the refGenomeId environment variable. I didn't go far that route because the notebook hung on the first call to vcf_view.[....].save(delta_path) without any error messege and 0 spark jobs completed . Not sure what went wrong there. So I went ahead and I've installed glow locally, hence the question about downloading the data to my local machine.

Following the documentation that You reference, everything seems to work fine with the glow-demo notebooks. Still, being able to download the data that those notebooks use, seems like a useful skill. How would I go about downloading say the phenotypes? I've succesfully tried calling the following in python cell in a databricks notebook:

dbutils.fs.cp("/databricks-datasets/genomics/1000G/phenotypes.normalized", "/FileStore/phenotypes.normalized", recurse=True)

and then

dbutils.fs.ls("/FileStore/phenotypes.normalized")

gives me a list of files, including two .parquets.

I can then go in my web browser to https://community.cloud.databricks.com/files/phenotypes.normalized/part-00000-tid-8050314092330887345-07adfaa4-b2de-4f64-9af7-d84ddcee2a4d-136386-c000.snappy.parquet?o=<...> (where <...> is 16 digits that I've copied from web address of the notebook) which opens a pop up download window for that particular file.

So I guess, my question is: is there perhaps a more convenient way to download the sample data, than exploring the directory structure through dbutils calls in a notebook, and going through the files one by one like that?

[Edit] Actually, if I my diverge from the topic slightly, I'm having a problem with paths in the GloWGR notebook:

pd.read_csv("/dbfs/databricks-datasets/genomics/gwas/Ysim_test_simulation.cs")

errors with FileNotFoundError, same for

pd.read_csv("/databricks-datasets/genomics/gwas/Ysim_test_simulation.cs")

.

karenfeng commented 3 years ago

Hi @olszewskip, for your original question: how do you want to explore the original data, and is there a particular reason why you'd want to download it rather than viewing it within the public datasets?

Your second question is a bit easier: I think you just forgot the tail v in the path. The following works for me:

pd.read_csv("/dbfs/databricks-datasets/genomics/gwas/Ysim_test_simulation.csv")