iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.58k stars 1.17k forks source link

api: high level dvc.api import/export meets data catalog #2719

Closed Suor closed 4 years ago

Suor commented 4 years ago

- what do you do for a living?
- oh, different things, import, export, family business
(any Hollywood smuggler and now a dvc dev)

This feature was first proposed by @dmpetrov and was discussed several times here and there. Recent conversations/debates with @dmpetrov and @shcheklein resulted into some more or less defined approach to it, which I will lay out here.

The need: we want things to be easier reused across repos, so we want some higher level api than just get()/read()/open(), presumably some call, which will return a ready to use dataframe, an instantiated model, etc.

The design

So here goes the design so far. The exporting side will need to list available things and write some instantiating glue code. Like:

# dvc-artifacts.yml
artifacts:
    - name: face_rec_model:
      description: Tensorflow face recognition model 
      call: export.make_face_rec
      deps: ["data/face_rec_model.dat", "data/extra.dat"]
# export.py
import tflearn

def make_face_rec(face_rec_model_data, extra_data):
    # ... construct a model
    return model

Then the importting side might simply do:

import dvc.api

model = dvc.api.summon("face_rec_model", "https://path/to/repo", rev=...)
model.predict(...)

Summon is a good (to my liking) and short name that refers to summoning a genie or a tesla :)

This will provide a flexible base to export/summon anything. Some questions arise about requrements, repeatetive glue code, Python/R interoperability and easing things for a publisher. I will go through this.

Requirements

Instantiating a model or a dataframe will require some libraries installed like pandas or PyTorch, or even their particular versions. In basic scenario we might ignore that - in most cases a user will know what is this about an will have it installed already. The thing is we can't provide it seamlessly since installing python libs inside dvc.api.summon() call will be surprising and might break some things. So deps management will be separate anyway. The possible way:

artifacts:
    - name: some_dataframe
      ...
      reqs: 
        - pandas>=1.2.3,<2
        - otherlib>=1.2
$ dvc summon-reqs <repo-url> <name> 
pandas>=1.2.3,<2
otherlib>=1.2
$ dvc summon-reqs <repo-url> <name> | xargs pip install

Glue code

There are many common scenarios like export csv-file as dataframe, which will produce repetitive glue code. This could be handled by providing common export functions in our dvcx (for dvc export) library:

artifacts:
    - name: some_dataframe
      call: dvcx.csv_to_df
      deps: ["path/to/some.csv"] 
      reqs: 
        - dvcx
        - pandas

In this particular scenario one might even use pandas.read_csv directly. There should be a way to parametrize those:

artifacts:
    - name: some_dataframe
      call: pandas.read_csv
      args:
        sep: ;
        true_values: ["yes", "oui", "ja"]
      deps: ["path/to/some.csv"] 
      reqs: 
        - dvcx
        - pandas

Note that deps are passed as positional arguments here.

Python/R interoperability

This is a big concern for @dmpetrov. This will mostly work for simpler cases like dataframes, there is probably no practical way to make PyTorch models usable in R, there might be for some simpler models like linear regression. Anyway to solve those I suggest providing dvcx library for R, sharing function and param names, so that same yaml file will work for R too. This is to my mind 2nd or even 3rd layer on top of basic summon functionality, so might be added later.

Making it easier for a publisher

Raised by @shcheklein. All this additional yaml/python files might stop someone from publishing things. I don't see this as an issue since:

So the "plain text alternative" - one might just write somewhere in a readme - I have this and this things and provide snippets like:

import dvc.api
import pandas

with dvc.api.open("cities.csv", repo="http://path/to/repo") as f:
    cities = pandas.read_csv(f)

This is easy to use, does not require any new concepts and any additional effort from our side.

What's next?

In my opinion the basic summon functionality might be implemented in the near future as having high value/effort rate and more or less clear design.

Still all of this was only discussed in my 1-1s with @dmpetrov and @shcheklein, so some input from the rest of the team would be highly desirable. So @efiop @mroutis @pared @jorgeorpinel please join.

shcheklein commented 4 years ago

Good one, @Suor . Thanks!

Just to start the discussion, the first question I would clarify is why DVC is not enough and we need to specify (effectively) types and list of exports in a new file.

And related one - if DVC-file is enough, can we let regular existing dvc.api be aware about types? And do not introduce new APIs?

Suor commented 4 years ago

@shcheklein in a general case an exported object might depend on several outs from different .dvc files. Even if we limit ourselves to single .dvc I don't like complicating stage files. We also start something representing catalog with that yaml file.

Existing api calls return strings or file handles if they suddenly start to return something else that would be crazy.

shcheklein commented 4 years ago

(DISCLAIMER: all stuff below is not an attempt to say that @Suor's approach is not right and we should go with DVC-files alone, please, read this as an attempt to challenge some assumptions, get a bit more examples, and understand better the decisions we make)

in a general case

Would love to have real-life examples before we make this decision. I would not overgeneralize until it's not necessary. @dmpetrov you should have some from the model zoo experiment? and may be from some other experiments? Would be great to see more use cases here, more example of what we are trying to solve, etc.

We also start something representing catalog with that yaml file.

not sure I get that point? Do you mean something that improves discoverability? Gives a way to higher level dvc list interface or something? Again, not sure why DVC-file is not enough here (granted with some additional changes).

Existing api calls return strings or file handles if they suddenly start to return something else that would be crazy.

that's true. I didn't mean reusing the same API calls literally. Some modification (or may be even another call) might be required. Probably this is the secondary question to the first one - what is the best way to export artifacts.

Suor commented 4 years ago

I will let someone else do the input )

ghost commented 4 years ago

Looks like a good API design) I'm still not buy into the idea that this is needed, might be the lack of understanding of the problem. However, I'm down for trying new features.

Some questions:

Cool stuff, @Suor 👍

Suor commented 4 years ago

With the current desing, is it possible to retrieve a specific version of such artifact?

The rev arg is still there to refer to any git revision same as in other api calls.

When summoning something, what would DVC do? (i.e. What would be the implementation for the current desing?)

With the example yaml file:

# dvc-artifacts.yml
artifacts:
    - name: face_rec_model:
      description: Tensorflow face recognition model 
      call: export.make_face_rec
      deps: ["data/face_rec_model.dat", "data/extra.dat"]

dvc.api.summon("face_rec_model", repo="<repo-url>", rev="<rev>") will:

  1. Clone the repo-url to a tmp dir, checkout rev.
  2. Read dvc-artifacts.yml and find face_rec key there.
  3. Download and checkout all the deps ("data/face_rec_model.dat" and "data/extra.dat") into that tmp dir.
  4. Run something like:
    
    sys.path.insert(0, tmp_dir)

This is extracted from "call" and "deps"

import export artifact = export.make_face_rec("data/face_rec_model.dat", "data/extra.dat")

sys.path.popleft()

return artifact

ghost commented 4 years ago
  1. Clone the repo-url to a tmp dir, checkout rev.
  2. Read dvc-artifacts.yml and find face_rec key there.

Wouldn't it be "Read dvc-artifacts.yaml" and then checkout to revision? Checking out before reading dvc-artifacts.yaml might give us a working directory where dvc-artifacts.yaml doesn't exist.

echo "v1" > foo
dvc add foo
git add foo.dvc
git commit -m "add foo v1"

echo "v2" > foo
dvc add foo
git add foo.dvc
git commit -m "add foo v2"

touch dvc-artifacts.yaml
git add dvc-artifacts.yaml
git commit -m "add artifacts"

If you want to access foo v1, checking out before reading dvc-artifacts.yaml will not work.


The overlaps between args and deps is a little bit confusing

Suor commented 4 years ago

Wouldn't it be "Read dvc-artifacts.yaml" and then checkout to revision?

The dvc-artifacts.yml resides in source repo, so you need to checkout that first. The rev is specified in an api call, not in dvc-artifacts.yml so single name there point to all available git revisions in a way.

Checking out before reading dvc-artifacts.yaml might give us a working directory where dvc-artifacts.yaml doesn't exist.

Yes, this is the issue. But if you don't have dvc-artifacts.yml or don't have specific artifact referenced there at specific revision then the instantiating code (export.py) won't be there also and you won't know how to instantiate the thing.

It is possible to separate dvc-artifacts.yml from dvc data repo somehow:

# dvc-artifacts.yml in https://github.com/company/dvc-catalog
artifacts:
    - name: face_rec_model
      description: Tensorflow face recognition model 
      repo: https://github.com/company/face-rec
      rev: 1.1 # Or should we allow specifying this in a .summon() call
      call: export.make_face_rec  # refers to export.py in `dvc-catalog` repo
      deps: "data/face_rec_model.dat" # refers to a file managed by `face-rec` repo

    - name: cats_dogs_model
      description: Tensorflow face recognition model 
      repo: https://github.com/company/cats-dogs  # other repo
      rev: v2
      call: ...
      deps: ...

This way you can reference artifacts from external sources and wrap them into summonable form. This could be the future extension. This, however, implies at least two source repos (catalog and data) and might be an overkill for many use cases.

There are lots of questions here also arise. Like distinguishing source and catalog revs, code refs in call and overall usability. So let's postpone this.

This extension possibility is however an another argument for separating artifacts decl files from stage files.

Suor commented 4 years ago

The overlaps between args and deps is a little bit confusing

It is. I didn't come up with something better for now.

dmpetrov commented 4 years ago

@Suor thank you for initiating the discussion! This is a really great direction for DVC.

In this issue, only data reusage/getting part is discussed which is the most important part from the implementation point of view. I think it makes sense to discuss all the data catalog scenarios here to make better requirements for this feature.

But first, what makes a set of files a data catalog:

  1. Description: just text
  2. Attributes: data summary, metrics, other attributes (license, author, input description, etc)
  3. Data types

Data catalog scenarios

Example

A special use case: model zoo.

Ideally, we should be able to create a repository with many models like https://github.com/cadene/pretrained-models.pytorch with all the information available through DVC API.

So, instead of reading text README we can use programmatic API:

Analogs

https://intake.readthedocs.io/en/latest/

Interoperability and supported types

@Suor proposing implementing all through Python which limits interoperability.

I'd suggest creating a more universal approach with different types/plugins: there are dozens of data frame types - each can be used from any language. See https://intake.readthedocs.io/en/latest/plugin-directory.html.

Python can be one of the supported types/plugins and all the ML models (like PyTorch) will be used through this type. So, we won't block future interoperability for common types.

Initially, only two types are needed CSV and Python (which covers the rest types initially).

Open questions

  1. Should we jump to the implementation of the high-level API/Summon without data catalog features? (It looks like they are related)
    • Can we show enough value without catalog features?
  2. What is a minimum set fo features ew need to show the value?
dmpetrov commented 4 years ago

One of the technical challenges that I see today is importing\getting\summoning multi-file resources. For example, to import a PyTorch model you need to get:

  1. a weights file
  2. a model definition file
  3. the model file might have some dependencies to other files (which are not even specified)

@Suor What would be the best way to overcome this issue? Do we need a specific API call? Should we clone the repo and keep it until the importing and instantiation happening. You should have more context on this. Could you please specify this and even open an issue?

Suor commented 4 years ago

@dmpetrov depending on several artifacts is demonstrated in the very first example, they need to be specified though. This is the only way for dvc to know what it should fetch and checkout. I don't know any reliable way to do that implicitly, there are a few ways to make it unreliably though, e.g. monkeypatching open() or inspecting code :).

Depending on any files managed by git is not an issue with current implementation, we do a clone anyway. I guess we will need to stick to that because people will tend to forget some utils.py they import or config.yaml they read along the way, and since these are not programmers but data scientists, who will be exporting stuff, we should design the system to have forgiving, DWIM style.

Suor commented 4 years ago

List all the models (50+ models)

This would look cool in a jupiter notebook.

Show metrics available

I guess we can implement this on top of dvc pipelines somehow. We can also show/visualize the whole pipeline: let people see the data and the code used, and all the data transformations.

This looks like an advanced thing for me.

Upload an updated version of the model back to the repo (advanced case)

I don't think this is possible without breaking a run or an import stage, which has that model as an out. Only possible for add stages, which means this model pipeline is already broken and we don't know how it was build, etc.

@Suor proposing implementing all through Python which limits interoperability.

We discussed this and implementing it whichever way limits interoperability or extensibility. I see how more declarative way will facilitate eyeballing and maybe some alternative implementations for very limited number of data types like csv.

Still not sure this is a good idea to have from the start. E.g. we have 2 ways of doing things instead of one.

dmpetrov commented 4 years ago

Different scenarios of data catalog usage. We can think about the scenarios as a requirement for this issue.

Models zoo

DVC-repo with pertained pytorch model (resnet18, resnet50, resnet101): https://github.com/dmpetrov/dvc-pretrained-resnet

Usage:

model = dvc.api.summon('https://github.com/dmpetrov/dvc-pretrained-resnet'
    artifact='resnet50’) # since there are many artifacts/models

Note, type: python specifies instantiation code. The code might be in the repo or in the current repo or a common library (like pd.read_csc - see @Suor examples). So, it looks like this method can be a part of DVC codebase since no pytorch specific stuff is required.

Example of using model zoos: https://github.com/dmpetrov/dvc_for_resnet-50

CSV file

Repo: https://github.com/iterative/df_sea_ice

Usage:

df =  dvc_cat.summon('https://github.com/iterative/df_sea_ice')

Note, the instantiation logic should be implemented as a library (dvc_cat?). The same way we can provide a library for R/Julia/Spark.

Complicated CSV file

Repo: https://github.com/iterative/df_sea_ice_no_header

Usage:

df =  dvc_cat.summon('https://github.com/iterative/df_sea_ice_no_header')

Compound CSV file

To bring only a slice of data. A repo example is not ready.

Usage:

df =  dvc_cat.summon.df_pattern(‘https://github.com/iterative/df_slicing',
            pattern=dvc_cat.FileNamePattern(‘user-actions-2019-11-*.csv’))

The pattern can be more complicated: dvc_cat.DataRangePattern("2019-%m-%d", start_date=x, end_date=y). This scenario is very important for analytical scenarios but it is not needed in the first implementation.

@iterative/engineering please share your opinions about the scenarios. Is something missing?

Suor commented 4 years ago

We are on the same page in the big picture part. Some thoughts mostly small or even cosmetic go below.

We need to answer a couple of questions as I see it:

  1. Do we want users to always refer to some .yml or we will have a default one?
  2. If we'll have a default, say dvcartifacts.yml or dvcexport.yml, then may we postpone multiple yamls in a repo support? We might never actualy need it.

My answers: No and postpone.

I would also like to keep .summon() API close to the other calls, so:

model = dvc.api.summon(
    'resnet50', # artifact name
    repo='https://github.com/dmpetrov/dvc-pretrained-resnet', # pass by name
    rev='v1.2'  # optional
) 

CSV files

We still need to pass artifact name:

# I prefer `dvcx` for dvc export/extension, also short 
df =  dvcx.summon('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header')

This, however, presupposes that we have some default way to summon csv files. We might want to avoid this and make it:

# Read csv into dataframe
df =  dvcx.to_df('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header')

# Read csv into Numpy array
arr =  dvcx.to_ndarray('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header')

Note that I didn't call it .csv_to_df(), a summoning user may not worry about format as long as it is supported, say csv, hdf, json or whatever. He or she worries about the result format, i.e. a dataframe or a numpy array or whatever it might be on particular platform.

Model Zoo vs Model Source

These are two scenarios, which slightly contradict each other. In a "model source" scenario we have a model with data and code used to build it and its metrics. All these things are stored in git and we may show them to user potentially which is cool.

However, if we dvc add a model file then we don't have any history or metrics, which we probably want to provide in an export .yml file. We will need some coordinated changes to keep those relevant, i.e. update a model, update metrics, maybe some other metadata (dates, authors, refs to data, method description). All of it duplicates data available naturally in a model source repo.

Then again we may turn model zoo into a model source by importing models from source repos, if that models are built within dvc repos of cause.

On patterns. Let's postpone that discussion, I see the value.

jorgeorpinel commented 4 years ago

I prefer dvcx for dvc export/extension

Quick question, why isn't it just dvc.api.summon, dvc.api.summon( ... ).to_df(), etc.

jorgeorpinel commented 4 years ago

Model Zoo vs Model Source

I also think DVC shines best in the model source scenario, but building model registries/catalogs without code or data sounds like a useful thing to do and it can be achieved with dvc import instead of dvc add, so the connection to the source projects is kept. 🙂

UPDATE: Oh, I see this is already mentioned:

Then again we may turn model zoo into a model source by importing models from source repos...

So it seems we agree the Model Zoo concept is most appropriate, as it can indirectly include model sources with DVC.

dmpetrov commented 4 years ago

@Suor thank you for the great feedback and the details that you have noticed! I agree with most of the items you brought.

It looks like the major unanswered question is model source vs. model zoos. Metrics files and the number of artifact-files are related to this I believe.

First, it looks like the scenario with a model source that was generated by data, code and a corresponding metrics file is a bit restrictive. In reality, metrics files can be created separately (custom or taken from previous runs) and DVC should not have control over this. From this point of view, the difference between metrics file (csv/json) and an artifact file that we are inventing now is not so dramatic as it might look like.

Second, in an ideal scenario, an artifact yaml file should be generated from a dvc pipeline the same way as metrics file. We will probably need a library to create metrics files and the artifact-yaml. Even without artifacts it would be nice to have.

So, why don’t we even merge these two concepts? We can say that any output yml file which marked as a metric file is an dvc-artifact.

  1. We don’t need to create an additional abstraction in DVC
  2. There are no questions on how many artifact files you can create - any, read them all

Disadvantage: you need to have a corresponded DVC-file even if you don’t use data-files at all. Create that file dvc run -M artifact.yml -O sea_ice.csv

What do you folks think about treating yaml files the same way as metric ones?

PS: Also, we discussed how to get artifact but had not discussed how artifacts can be imported. We might need a new command like dvc import-artifact resnet50 http://myrepoaddress.

dmpetrov commented 4 years ago

I prefer dvcx for dvc export/extension

Quick question, why isn't it just dvc.api.summon, dvc.api.summon( ... ).to_df(), etc.

@jorgeorpinel dvc.api.summon.* or dvcx.* it is a really good question. Which method to use depends on where the deserialization code lives.

For CSV we need a code that can deserialize it. The code will depend on libraries like pandas that we don't want to include as a dependency of DVC. So, it is better to extract it as a separate package dvcx.

In the resnet case, python code is a part of the imported repo - type: python. User just call the code for deserialization. In this case, it is okay to keep all the dependency-library management to user's shoulders. So, we can implement basic stuff inside DVC without introducing new dependencies.

Suor commented 4 years ago

Second, in an ideal scenario, an artifact yaml file should be generated from a dvc pipeline

The artifact file is completely human written, it has many things like name, description and ref to human written code. None of this can be generated. Metric in contrast is an output, which will be rewritten by command.

So I don't see how this may be combined:

Also, we discussed how to get artifact but had not discussed how artifacts can be imported.

You first need to say what do you mean by importing artifacts. For me artifacts are strictly in-memory objects. If you mean importing all their deps plus instantiating code then why do we need this?

Suor commented 4 years ago

So for model zoo I see two options:

These are not mutually exclusive, but I would prefer starting with one.

Suor commented 4 years ago

A high-level consideration: do not mix human written and machine generated content in a single place. This inhibits its human writability and readability.

P.S. I know we already do that in stage files, but it's hard to do anything about that now and we had issues with that.

dmpetrov commented 4 years ago

@Suor your points about dvc add, single yml files, missing metrics files and human readability are good but could be overcome or not so crucial.

However, the point about output (generating artifact file) is great - it will create duplications (retrain a model if model description changes or modify in two places and commit). It could be better to keep artifact files and metrics files separately. I don’t think that any change coordination is needed at this point - the user should decide. Also, it looks like the most simple decision from the implementation standpoint.

PS: Let’s discuss importing artifact later. I think it is a useful feature. The only problem - it won’t work for python imports, only to “catalog native” types such as csv. This discussion can be postponed till the first implementation.

ghost commented 4 years ago

should we add dvc.api to analytics? does it make sense?

efiop commented 4 years ago

@mroutis Good question. It does make sense, but we need to take special care with that one. Consider creating a separate ticket.

ghost commented 4 years ago

what about using dvc-formulas.yml or dvc-conjures.yaml instead of artifacts?

The caster (magician) uses formulas to summon agents.

My favorite would be grimoire.yaml (but quite obscure)


reading that sentence with the following soundtrack https://www.youtube.com/watch?v=Bz6Kc4rcfOk

ghost commented 4 years ago

reference: https://github.com/huggingface/transformers

ghost commented 4 years ago

I think a nice addition would be to pass parameters through summon:

dvc.api.summon("say_hello", params={"name": "world"}, repo="https://github.com/blahblah")

This brings more flexibility:

dmpetrov commented 4 years ago

@mroutis yes, parameters overwriting should be supported in the function. Sorry, it was not specified directly.

dmpetrov commented 4 years ago

@mroutis formulas.yml (without dvc- prefix) is a good one.

dmpetrov commented 4 years ago

@mroutis for the items inside the yml you can use term data objects - it is a nice addition to data files and data artifacts.

These items have types. So, object explains the nature of this thingy pretty well.

ghost commented 4 years ago

I like the purposed terminology, @dmpetrov , will do the change :)

Suor commented 4 years ago

So type: csv summons are implemented in https://github.com/iterative/dvcx, it requires latest master dvc as of now. Things that are not in yet:

jorgeorpinel commented 4 years ago

docs?

Yes, def. docs! Please see iterative/dvc.org/pull/908 (although I'll probably open a separate issue for a more detailed summon() ref), #3092, and new: iterative/dvcx/issues/1

Suor commented 4 years ago

docs have question because I see keeping everything in README as an advantage for smaller projects both for users and devs. Creating docs makes it slow, which is not good for a new evolving project. It takes weeks to merge something to docs now.

shcheklein commented 4 years ago

@Suor README + good docstings + examples should be enough, I agree. We'll move stuff that needs to be on dvc.org (for SEO, for awareness of summon existence, etc) after that.

It takes weeks to merge something to docs now.

yep, we can try and address together on the next retro.

jorgeorpinel commented 4 years ago

Yeah "docs" doesn't have to mean the dvc.org website. 2 of the 3 issues are linked are about docstrings. READMEs are great too (and may need more love, do we ever review them?). But all these kinds of docs have different public and goals probably. Good topic for retro indeed.

weeks to merge something to docs

I checked the last 10 normal PRs in dvc.org. Average merge time is around 24 hours. But yes, it can take much longer sometimes.

efiop commented 4 years ago

@Suor What is the status in this ticket? What are our next steps?

Suor commented 4 years ago

Closing the remaining points in favor of separate issues in dvcx: https://github.com/iterative/dvcx/issues/5 and https://github.com/iterative/dvcx/issues/6.