DBFS (Databricks) Support

Guichaguri commented 4 years ago

Databricks has it's own file system format.

I've read a bit of the DVC code and I think I can work on a PR adding support for it by using its REST API. Is there any reason not to work on it?

efiop commented 4 years ago

Hi @Guichaguri !

I'm not familiar with DBFS, but I would double check if maybe it supports one of the already supported APIs (e.g. we have s3 support, ssh, hdfs etc). From quickly looking into the docs, seems like it is based on s3 somewhere internally, but I can't see any common APIs that we could use from the list that we already support. So looks like a new remote type would indeed be necessary.

Do you use databricks yourself, or do you just want to contribute the support for it?

majidaldo commented 4 years ago

Well they have their dbutils.fs in Python. I think that would be a bit more 'native'. I'm using databricks.

omrihar commented 3 years ago

Hello, is there any update regarding this issue? I'm also using databricks and would like to try using dvc together with it.

Thanks, Omri

skshetry commented 3 years ago

@omrihar, unfortunately, no one has worked on it yet. We are more than happy to accept the contribution for the feature.

majidaldo commented 3 years ago

Besides the mapping, I don't know how this would work when using databricks notebooks; there is no git integration.

omrihar commented 3 years ago

Unfortunately I'm not sure I am able to add the databricks support - I'm not well versed enough in databricks and dvc for that.

@majidaldo I was thinking of using databricks DBFS for data storage (as S3 is used). You are correct that the current git integration is very lacking... However it still may be possible to use the dvc API to get the files from a databricks notebook with data lying somewhere in dbfs, maybe?

I'm quite new to databricks and am not yet 100% sure what's the best strategy to deal with the various aspects of solid, full-blown ML workflow on databricks.

majidaldo commented 3 years ago

Unfortunately I'm not sure I am able to add the databricks support - I'm not well versed enough in databricks and dvc for that.

@majidaldo I was thinking of using databricks DBFS for data storage (as S3 is used). You are correct that the current git integration is very lacking... However it still may be possible to use the dvc API to get the files from a databricks notebook with data lying somewhere in dbfs, maybe?

I'm quite new to databricks and am not yet 100% sure what's the best strategy to deal with the various aspects of solid, full-blown ML workflow on databricks.

If you're just getting and uploading files from dbfs, dvc isn't adding value here. The dbfs api and dbfs mount is pretty straightforward to use. There is the ugly hack of copying your git repo to somewhere that databricks can see.

Databricks wants you to use MLFlow. After using Databricks for a while, I almost despise any platform that throws me into notebook-land.

isidentical commented 3 years ago

We are rapidly moving towards fsspec, and it comes bundled with a databricks implementation. So if there is nothing on the dbfs that would contradict with the DVC's assumption regarding filesystem-related stuff, DVC might start supporting it after the full-adoption (1-2 months) https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/dbfs.py

smartkiwi commented 2 years ago

I'd like to ressurect this issue. Can we clarify what kind of DBFS support for DVC we are taking about - start with requirements and use cases.

I can think about several

DBFS as a blob storage backend for DVC (accessible from withon Datsbricks cluster, where DBFS is accessible through local FS interface, also from outside Databricks env via REST API).
- DVC use DBFS as a target FS - i.e. datasets files tracked by DVC are stored on DBFS - exposed to various tools that Databricks provides - I.e. parquet file can be viewed in SQL mode. And there is some mechanism for switching between versions of dataset

I'm considering the way to start using Databricks for ML project workflows. So the second requirement is way important to me. And I cared way less where DVC stores blobs, as long it is accessible from both within Databricks env and outside, like developers laptop or in container ran by CI pipeline. Such storage can be s3.

For now I'm thinking about using EC2 instance with s3 bucket DBFS mounted and checking out multiple versions of code and datasets into separate folders. So for Databricks tools information abut version will be lost - it would just see multiple folders with files.

I'm also going to try using Databricks Repo feature with git repo that had DVC set up in it - and run DVC commands after git checkout and before git commit.

edmondop commented 1 year ago

Was this resurrected?

dberenbaum commented 1 year ago

@edmondop How are you trying to use DVC with DBFS?

edmondop commented 1 year ago

@dberenbaum the idea is to put some versioned excel files in DVC (the input data source is really in excel) and have a job that clone the code and ingest that as a delta table on databricks. DBFS is not the concern here, is how to version the input of the job

dberenbaum commented 1 year ago

Are you hitting an issue trying to do that, or are you looking for general advice on what the workflow should be?

edmondop commented 1 year ago

Should I install DVC in a notebook and perform a dvc pull from the notebook? That would probably work

dberenbaum commented 1 year ago

Sorry, I don't have handy access to Databricks to test, but you could try to set up your git repo on there and try it.

You might also want to configure your cache directory to be on DBFS or some mounted storage since the default may be limited and unreliable.

tibor-mach commented 1 year ago

@dberenbaum This is actually very interesting for me as well, since we work with Databricks quite a lot (though not exclusively) and an increasing number of our clients are interested in data traceability and versioning.

Specifically, I am looking for a way to be able to work inside a Databricks notebook. It could be interactively, as a data scientist might, or programmatically in case those notebooks are run in a pipeline. By the way Databricks notebooks are really just .py files with some sugar, Databricks notebooks are not the same as Jupyter notebooks and so they are a bit more production and versioning friendly and so using them in production is a real possibility.

Either way, that is probably of little import, as the biggest issue seems to be the way Databricks interacts with the remote git versioning. You can set up what is called "Databricks repos" which basically allows Databricks to clone a repository from GitHub/GitLab/whatever to your Databricks environment and provides a simplified UI which can be used to version changes users make in their local copy (local = in Databricks environment in this case).

Basically, it can be seen as a VM which serves as your "local" development environment where you cloned the repo...except that it cannot because that is not how the repos are handled by Databricks. When you run Databricks notebooks (or .py scripts, doesn't matter) in the Databricks environment, you run them on a Databricks cluster, or rather its driver node. Databricks supports repo files programmatically from inside the notebooks, so you are able to use python to read, write and modify the files which are notionally in the repo. If you use the GUI, you can then commit and push the changes. Note that you can only commit AND push at the same time, there are no local commits, because all the actual code versioning with git is hidden from the users for some reason. If you try to use the git python API to get info about the repository from inside databricks like this

import git
repo = git.Repo(search_parent_directories=True)

you get the InvalidGitRepositoryError.

Running a git shell command (git status for instance) from inside the notebook (there is a %sh magic which allows you to do that) has the same effect, returning

fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

To be honest, I am not quite sure how code versioning is handled by Databricks under the hood. It restricts you to using their GUI which limits what you can do and to access info about the repository from inside the notebooks you have to do things like this. You need to use their dbutils library which is really just an API for a toolset written in Java and there is basically no public documentation for any of the more advanced operations.

I am not sure why Databricks decided to implement versioning inside their environment like this, but as it stands I think is in fact quite difficult to combine Databricks with DVC in a meaningful way. A shame, really, since if nothing else then Databricks is a great platform for processing big data (as Databricks is among other things a managed Spark environment) and these big data pipelines can often benefit from data versioning (especially in large corporate environments where traceability and certification is important). I'd love to use the DVCFileSystem and the new DVC Cloud versioning feature for exactly these types of use-cases but unfortunately it clashes with the way Databricks handles code versioning.

I guess I might actually ask them (Databricks devs) about this as well.

cc @edmondop - this might be interesting

and I guess also @shcheklein - you were asking me a few months ago about any feature requests I might have on Iterative, this would be a big one. But I am not sure if Iterative can actually help with this, the way I understand it the issue seems to be on the Databricks side. At best a feature request from you might help or perhaps they could help you find a workaround for integration...that is of course if you find working on something like this worthwhile.

tibor-mach commented 1 year ago

Come to think of it, I guess there is a workaround that doesn't do everything, but can at least make it partially work together:

You set up a DVC data registry and you version any raw data that you want to use in your Databricks workflows. It is probably a good idea to use DVC Cloud versioning for this as well, since it will allow you to see the data in a human-readable way on the storage.
As a part of your pipeline on Databricks (or in a config yaml or anything really) you specify the dataset from the data registry and its version that you want to use as a part of the pipeline. Then you just use DVCFileSystem from the DVC Python API to access the data from the registry
You will probably also need to set up credentials to access the storage where the datasets are actually stored, but Databricks has built-in secrets management you can take advantage of for that.

Unfortunately, this won't allow you to store references to the data directly with the code for your ML training/big data pipelines/whatever, but it will at least allow you to be sure the pipelines always run using a clearly specified and versioned dataset as an input.

cc @edmondop

By the way, let me know @dberenbaum if you find this interesting, I probably want to try this approach out anyway and I might as well write a short blog post about it while I'm at it...

dberenbaum commented 1 year ago

Thanks as always @tibor-mach!

I have been a Databricks user before (including Databricks repos) but am out of practice, so I appreciate the refresher. Your idea to use DVC purely as a data registry for consumption through the DVCFileSystem makes sense as a starting point.

Specifically, I am looking for a way to be able to work inside a Databricks notebook. It could be interactively, as a data scientist might, or programmatically in case those notebooks are run in a pipeline. By the way Databricks notebooks are really just .py files with some sugar, Databricks notebooks are not the same as Jupyter notebooks and so they are a bit more production and versioning friendly and so using them in production is a real possibility.

I think running Databricks notebooks as jobs is another potential integration point. It should be possible to run using the jobs cli in a DVC pipeline.

Interactively running DVC pipelines or experiments from within a notebook might be possible, but how useful would you find it? As you mention, Databricks is especially useful for large Spark data processing jobs, and I think DVC is often (but not always) more useful downstream from these jobs. Caching and versioning all of the data for a large Spark job may not be realistic or desireable, but it may be useful to have a "no-cache" DVC pipeline that can provide traceability about changes to the dependencies and outputs in the pipeline, even if it can't recover past datasets.

edmondop commented 1 year ago

@tibor-mach thanks for following up! @dberenbaum the idea is that you have reference tables that you want to store under version control, and version them with your code. These are typically small tables , so you don't need spark, but not small enough to be stored in Git

dberenbaum commented 1 year ago

Great, thanks @edmondop!

@tibor-mach Do you plan to give this a shot with the data registry + dvcfs idea you proposed?

tibor-mach commented 1 year ago

@dberenbaum Sure, I intend to test it either tomorrow or next week, depending on how much time I have.

Triggering Databricks jobs as a part of a dvc pipeline is actually a very interesting idea as well for some use-cases. I suppose that a combination of that and the external data feature of DVC, you should actually be able to run arbitrary pipelines in Databricks while actually taking a full advantage of DVC and even dvc pipelines (which might be very relevant, because the ability to skip some steps of the pipeline in case nothing changed since the previous run can save a lot of costs in scenarios where you actually need this big data wrangling as a part of your ML training pipeline.

So my plan is basically to try the first approach first and then also experiment with jobs+external data management. I'll see how that goes. Since this would address a few thinks that are related to our current projects in my job, I think I should be able to find some time for both.

dberenbaum commented 1 year ago

Sounds great @tibor-mach!

Triggering Databricks jobs as a part of a dvc pipeline is actually a very interesting idea as well for some use-cases. I suppose that a combination of that and the external data feature of DVC, you should actually be able to run arbitrary pipelines in Databricks while actually taking a full advantage of DVC and even dvc pipelines (which might be very relevant, because the ability to skip some steps of the pipeline in case nothing changed since the previous run can save a lot of costs in scenarios where you actually need this big data wrangling as a part of your ML training pipeline.

👍 The external data part is where it gets a bit awkward IMO because it is both clunky to set up and doesn't fit well into the typical DVC workflow. Without each team member having their own copy of the dataset, lots of problems can arise, like accidentally overwriting each other's work. However, I think for the use case of simply skipping steps in case nothing has changed (which I agree is probably the most useful part of this scenario), it may make sense to set cache: false for the external data sources, which simplifies the workflow and avoids other problems like storing backups of every copy of potentially massive datasets.

tibor-mach commented 1 year ago

@dberenbaum It seems that in the newest version of databricks, the databricks Repos actually work like git repositories. However, I would still need to pass credentials (say storage account name and key on Azure) to DVCFilesystem (or to dvc.api.open() which I guess uses DVCFilesystem in the background)...is there a way to do that explicitly?

I assume that normally this information is looked up in .dvc/config.local. But since the "local" here is on Databricks it is probably not a good idea to explicitly add such secrets. I would prefer to use a secrets scope which Databricks has to store the credentials and then explicitly pass them to the dvc.api when needed.

dberenbaum commented 1 year ago

@tibor-mach Sadly we don't have a way to do that yet (see https://github.com/iterative/dvc/issues/9154) ☹️ . If you want to chime in there, it would help to have all the use cases for it in one place for prioritization.

tibor-mach commented 1 year ago

@dberenbaum I see. I guess it will be difficult to go forward with the way I wanted to test the integration with Databricks (though I think I can still try out the option with jobs you suggested).

I will link this to #9154 , thanks.

shcheklein commented 1 year ago

Good discussion. @tibor-mach thanks for the input. I've installed the environment on Azure to play with it a bit.

especially in large corporate environments where traceability and certification is important

Could you clarify on this? I have some idea, but would be great to hear your thoughts.

if we use DeltaLake tables - is it enough or not, and if not why? (they have versions and history, etc).

You set up a DVC data registry and you version any raw data

do we pull it then before running anything?
does it means that it's just some data in files that you control, not let's say DeltaLake, etc?

tibor-mach commented 1 year ago

@shcheklein

In a lot of companies, particularly in engineering, the ML models are used as one of the tools in development of new machines. When you are talking about things like trains, airplanes (even cars but there it is slightly different), there is a certification process they have to go through and as a part of that what they need is that all data that are used as a part of the development are clearly traceable and reproducible. This is also why tools like DVC are interesting for these companies. But at the same ti

DeltaLake does versioning but in a a way that does not lead to clear reproducibility in the context of ML. To me DeltaLake table versioning is great for instance when you accidentally overwrite a critical table that is computed in a large batch job that runs once a month...then you can simply revert to the previous version. But it is not quite as handy to be used as a tool for ensuring reproducibility of ML models. DVC is great in this regard in that it really provides a GitOps approach to this...so then in the end, git is the only source of truth and (at least as log as you also use containers for training) you can talk about full reproducibility. That is an ironclad argument for any certification process which requires you to trace back and be able to reproduce all the outcome data (some of which is generated with ML) which you use as a basis of a certification of your new train motor for instance.

do we pull it then before running anything?
I guess so. It is actually one of the things I would like to figure out how to do better with DVC - not having to actually pull anything while still versioning the data. If we are talking about working with either big or somehow restricted data (often the case with things like airplane turbines and the like!), then pulling it locally might not be an option. does it means that it's just some data in files that you control, not let's say DeltaLake, etc? I guess the best case scenario would be to be able to use DVC for versioning of DeltaLake (or even simple parquet) files stored in a datalake without the need to pull them for processing. So for instance if you run any actual ML on Databricks you could still use a GitOps approach for the entire data processing and ML (this to me is one of the biggest strengths of DVC and biggest weaknesses of Databricks).

If you just run your ML training in a container on an EC2 instance (for instance) then this is not that big an issue I guess since you can always use CML for that and keep all data secure and out of reach of any user's local machine (it is still a problem with data which are not allowed on cloud but then you can simply use dvc with self-hosted on-prem runners).

But Databricks is also quite handy as a tool for data exploration and prototyping (basically using it like you would use jupyter lab with the added benefit of having access to managed spark). There, what I would love to see is the ability to have data versioned via git so that if I want to share a project with a colleague, I just tell them a name of a specific branch/tag/commit in the project repo and rest assured that they will be looking at the same data I do. You can do that with DVC already, but not in the context of the Databricks environment and so I am also trying to figure out how to do that best.

dberenbaum commented 1 year ago

I was able to do at least rudimentary work in a DVC repo using Databricks repos by configuring a few settings at the top of my notebook:

# Configure DVC to not use Git in Databricks
%sh dvc config --local core.no_scm true
# Set the cache dir to a DBFS so data isn't stored on the local node
%sh dvc cache dir --local /dbfs/dvc/cache

@tibor-mach Does it seem useful to you? Do you want to try working with those settings?

Edit: dropped this part as I realized the linking wasn't working on dbfs:

# Set the cache to link instead of copy
%sh dvc config --local cache.type hardlink,symlink

tibor-mach commented 1 year ago

@dberenbaum Thanks for the update. I will have a look at it, but now I will be away (and without a computer) until the end of next week. I'll let you know afterwards.

efiop commented 1 year ago

Thanks everyone for a great and detailed discussion, it was a pleasure going through it.

After reading all the points made here, discussing this with @dberenbaum and @shcheklein , and playing a little bit with databricks myself, it seems like we have a few points here:

dvcfs/dvc get lack of polish with regards to credential passing. We will need this at some point no matter what, but it is also needed for other scenarios. No reason not to fix that right now.
onboarding on databricks - ideally you shouldn't need to set no_scm and maybe this is a good opportunity to maybe auto detect that (maybe a new autonoscm option or just by-default, not sure, but definitely want to avoid detecting that it is specifically databricks we are dealing with). I also poked around and couldn't find any on-the-surface traces of git repo or anything like that in the filesystem/environment/etc on databricks, and I can only guess that they are probably doing versioning on top of overlayfs of the container and interacting with git there.

Then we have different levels of functionality that one could want:

track data (aka "data fingerprinting"?) - be able to make dvc remember/codify the data version in native clouds (etags, version_ids for cloud versioning, etc) or dbfs (by capturing local metadata like mtime, very similar to the hdfs ideas we had when we were dancing around spark in the past). This can kinda be done with cloud versioning and dvc import-url/update --no-download, but it is a bit rough.
same as ^ but in pipelines, so that you could maybe use dvc to launch jobs and stuff and make it remember versions of the dependencies and outputs that it got, so you have a record of this codified in your git repo. This we don't support yet even for cloud versioning, but implementation is kinda straightforward and we can do that quickly.
using dvc experiments. This makes a lot of sense in the databricks environment, but as noted before we have a few problems because of the lack of access to the git repo we are working in within databricks. We could work around that by just keeping our own internal git clone (like we do during operations like dvc get or one using dvcfs with a remote repository), but we need to somehow let dvc know what github repository it is working with, since I can't seem to find any traces of that anywhere in the environment. Maybe that could be done by some kind of explicit code to initialize it in your script/notebook.

So far we've been mostly worried here about tracking large data in clouds/dbfs, but not caching smaller stuff locally. The latter should work as is already and we might only want to recommend creating a local remote or local cache (as Dave suggested above) on dbfs to be able to easily share the cache.

So in terms of action points, I think I'll try to get the dvcfs/get creds problems sorted first, so that one could at least use dvc more comfortably within the databricks/etc. Let me know what you think.

tibor-mach commented 1 year ago

@dberenbaum I finally tried it out. Your workaround seems to solve one issue but there is still the problem with authentication. If I want to use dvc.api.open to access data, I need the cretentials for Azure/AWS/GCP/whatever.

I could of course simply copy them to the config.local file but the databricks repos are not truly local in the same sense my own personal computer is and so this is far from ideal (I bet no security department would allow this).

One way to solve this is to enable setting credentials (#9154 ) via the Python API. Then the credentials can be stored in a secret vault (be it the one from Databricks or native to the specific cloud) and retreived from there each time they are needed.

tibor-mach commented 1 year ago

Btw @efiop this is somewhat relevant. You can at least get info about the repo metadata this way and find out the url to the original repository, the ...but it is on veeery shaky grounds as this is not documented anywhere and subject to change by Databricks at any time (although I guess it should hopefully at least stay fixed for a given Databricks runtime).

You get this as the content of repo_data:

{'id': <I dunno what the id stats for, actualy...>,
 'path': '/Repos/<username>/<Databricks_Repos_folder>',
 'url': <URL of the actual git repository, eg. https://github.com/tibor-mach/myrepo>',
 'provider': <name of the provider, e.g. GitHub>,
 'branch': <current branch you are on in the Databricks Repos>,
 'head_commit_id': <what it says...>'}

Originally this also gave you the state of the working tree (if it is dirty), but then they removed that from the output json for some reason. Maybe there is a way to access it somehow, but given the lack of documentation you'd need to guess and maybe just do a grid search of all the methods of the various dbutils objects. And of course, this would be extremely shaky. I guess maybe contacting Databricks directly and asking them for an "official" way to do this might be the best approach.

efiop commented 1 year ago

For the record: https://github.com/iterative/dvc/issues/9154 (dvcfs support for creds passing through config) works now. Feel free to give it a try. I'll add a corresponding config option to our dvc.api methods https://github.com/iterative/dvc/issues/9610, but dvcfs can be used in the mean time and is much more powerful as an added bonus 😉

EDIT: api read/open now also support config. Will add some good examples in https://github.com/iterative/dvc.org/issues/4627

tibor-mach commented 1 year ago

@efiop Cool, I will give it a try.

Also, Databricks are developing a python SDK might also be worth checking out and maybe a good too for integration with Databricks. But the development has only recently started (which might be a good opportunity to get in touch with the devs as things are likely going to be less defined on their part at this point and they might be more receptive to feedback)

dberenbaum commented 1 year ago

same as ^ but in pipelines, so that you could maybe use dvc to launch jobs and stuff and make it remember versions of the dependencies and outputs that it got, so you have a record of this codified in your git repo. This we don't support yet even for cloud versioning, but implementation is kinda straightforward and we can do that quickly.

Related discussion in #8411

efiop commented 1 year ago

For the record: if anyone is running into Function not implemented error when trying to run CLI dvc in databricks, we've fixed that in https://github.com/iterative/dvc-objects/pull/216 and new installations of dvc should automatically pick up that fix (if not, make sure dvc-objects is >=0.24.0).

efiop commented 1 year ago

For the record: we now have some basic info on databricks + dvc in our docs https://dvc.org/doc/user-guide/integrations/databricks

iterative / dvc

DBFS (Databricks) Support #4336