The MongoDB connector should support the `prisma introspect` command

nikolasburk commented 5 years ago

When using the MongoDB connector with an existing database, I currently need to model my data by hand. It would be great if the prisma introspect command could help me with this by sampling a number of documents from the collections inside my database and suggest a datamodel based on that.

ejoebstl commented 5 years ago

Proposed design goals

The introspection process ...

should not be interactive
should create a datamodel file that's easy to tweak, e.g. by inserting useful comments
should yield an equivalent datamodel when run against a database created by prisma from an existing datamodel,

Proposed tasks

All mentioned Todos only affect the CLI component.

First, some cleanup should be done. This is optional but recommended, to avoid duplication.

[x] Define an abstract internal format for the representation of database datamodels, opposing to the current relational-focused model. This model should also support representing error states, e.g. invalid schemas which can occur with NoSql databases.
[x] Refactor the SDL inferrer to support the new format, make use of abstraction for different database types.
[x] Refactor (or wrap) the existing postgres connector to support the new datamodel.
[x] Look for duplicate definitions and utility methods through the code, especially related to pluralization and capitalization. Move those to some central utility module.

After this, we can implement the MongoDB connector:

[x] Create a base class for the fetching schemas from any NoSql Database. This class should already support an interface for abstract sampling strategies, intersection strategies and as well as an interface for resolving relations.
[x] Create sampling strategies: First, Random-N¹, All.
[x] Create relation resolver strategy: Index-Lookup²
[x] Implement the actual MongoDB connector

Notes on Sampling:

1: Samples N random documents from each collection and tries to find a useful intersection.

Multiple samples are merged. Fields that are found in all samples are made required, fields that are found in some samples are made optional.

Notes on Resolving Relations

2: For fields of type UUID or ObjectID, performs an index lookup on all collections to guess if a relation exists. Alternatives/possible additions would be: Guessing relations by name, or to include other types to the lookup as well.

Relations for embedded documents are recursively resolved.

Handled corner cases

When there are documents with the same field, but different primitive types, we simply generate a comment which indicates the conflict.
MongoDB allows '.' and '$' in variable names. We sanitize the type name.
When we find embedded types with equal schema, we summarize them to a single embedded type.

Unhandled corner cases

If we encounter such a case, we abort.

Duplicate field names.
_id fields which are not a primitive type.

Dependencies

Existing databases might have embedded types without any _id field. Related to #3575.

ejoebstl commented 5 years ago

This PR implements an alpha version of Mongo introspection. I've abstracted the concept of document databases, so it should be super easy (150 LOC) to add other document databases in the future. Schema rendering is now a completely independent module in prisma-datamodel.

Resolving Behavior

The default behavior is: Sample one element from each collection to infer a flat schema, then do a lookup of all fields of 50 randomly selected items to find relations.

Let's see how well this works with real-world data.

For now, we try to infer relations on all ObjectID and string fields. I'll test that with real-world data.

Open Todos

[ ] Type naming: It might be desirable to singularize type names when inferred from an array type field (Example: embedded type for field orders should be called Order).
[x] Refactor primitive type inference to a separate class or module.
[x] More tests, especially complicated or messy datasets.

ejoebstl commented 5 years ago

There are currently the following open questions for this feature. I suggest we wait for input of some users who tried the new beta release to answer this questions:

[ ] It might make sense to use random sampling for inferring the flat model as well.
[ ] Is it desirable to infer required/not required? E.g. when a field was set on any sampled document, we could mark it as required.
[ ] Bi-directional relations are not considered right now. I'm not sure if it's a good idea to infer that from the data model.
[ ] Type naming, as mentioned above. Do we have any reference implementation for regularizing stuff?
[ ] I am not sure if all primitive types supported by mongo/BSON are mapped in the best possible prisma type right now.

nikolasburk commented 5 years ago

I just tested the introspection with this data that was structured according to this datamodel:

type User @db(name: "users") {
  id: ID! @id
  email: String @unique
  name: String!
  posts: [Post!]! @relation(link: INLINE)
}

type Post @db(name: "posts") {
  id: ID! @id
  wasCreated: DateTime! @createdAt
  wasUpdated: DateTime! @updatedAt
  title: String!
  published: Boolean @default(value: false)
  author: User
  comments: [Comment!]!
}

type Comment @embedded {
  text: String!
  writtenBy: User!
}

This was the output that was generated:

type posts {
  _id: ID! @id
  published: Boolean
  title: String
  wasCreated: postsWasCreated
  wasUpdated: postsWasUpdated
}

# type postsWasCreated @embedded {

# }

# type postsWasUpdated @embedded {

# }

# type User {

# }

type users {
  _id: ID! @id
  email: String
  name: String
  posts: [ID!]!
}

EDIT: Note that the dataset I used was extremely small:

2 documents in users
3 documents in posts
- 2 of the 3 documents had 1 subdocument each in comments

See data

Data: ![image](https://user-images.githubusercontent.com/4058327/49940221-3d099380-fedf-11e8-9639-7c6cb4e8c76a.png) ![image](https://user-images.githubusercontent.com/4058327/49940236-472b9200-fedf-11e8-8886-cd63bbe7cdb4.png)

nikolasburk commented 5 years ago

One general consideration might be that we generate model names that follow the Prisma conventions, i.e. start with uppercase letter and use singular version and use the @db directive to map to the underlying collection.

I opened an issue for this: https://github.com/prisma/prisma/issues/3702

ejoebstl commented 5 years ago

Thank you for the input. I will look into the relation issue immideately. Can you PM me the data as JSON or PM me credentials for the database?

Regarding the naming: Great idea! To respect prisma conventions, we should singularize type names. Is there any reference for this in prisma so far? Otherwise I can just do something trivial, like trimming trailing ses.

nikolasburk commented 5 years ago

We have some scarce docs for naming conventions (it actually doesn't mention the uppercasing of models) here. The data was produced using the following three mutations:

Create two new users

mutation {
  user1: createUser(data: {
    email: "alice@prisma.io"
    name: "Alice"
    posts: {
      create: {
        title: "Join us for GraphQL Conf 2019 in Berlin"
        published: true
      }
    }
  }) {
    id
  }

  user2: createUser(data: {
    email: "bob@prisma.io"
    name: "Bob"
    posts: {
      create: [{
        title: "Subscribe to GraphQL Weekly for community news"
        published: true
      } {
        title: "Follow Prisma on Twitter"
      }]
    }
  }) {
    id
  }
}

Add comments to two posts from `Bob` (send twice)

mutation {
  updatePost(
    where: {
      id: "__ID_FROM_BOBS_POST__"
    }
    data: {
      comments: {
         create: [{
          text: "Love it 👏"
          writtenBy: {
            connect: {
              email: "alice@prisma.io"
            }
          }
        }]
      }
    }
  ) {
    id
  }
}

ejoebstl commented 5 years ago

The problem can be split into the following issues:

[x] Only the first item in a collection is sampled for model resolution by default, therefore the comments embedded type is missing completely. I will change that default to random sampling.
[x] DateTime scalar type is not handled correctly. I will re-work the scalar handling.

pantharshit00 commented 5 years ago

I tried the mongo introspection with a customer. The database has many nested embedded fields. The introspection result has many errors(Especially look TenantStoreDataMappingMappedColumns which had some interesting results): https://pastebin.com/E3UPd8J6

Here is a sample document from the database: https://pastebin.com/Nxgp3YEq. The DB had around 748 records and introspection took a while so he reported this at first place because he though introspection was not working. So I would also suggest adding a progress bar in the future.

Even when I corrected and deployed the datamodel manually I got an error saying Prisma can't handle ObjectId('....').

I used prisma version 1.24-beta.

ejoebstl commented 5 years ago

Thanks for your bug report - it's super helpful to have some real-world examples.

Are there other documents in the database that look differently? I don't think the introspection would generate a TenantStoreDataMappingMappedColumns type at all from the given sample.

If so it would be incredibly helpful if you could share those with us.

pantharshit00 commented 5 years ago

Unfortunately I had a pretty limited access so I was only able to extract the above info. The tenant document is the only thing that he gave me as an example. I will try to extract more info from him though.

pantharshit00 commented 5 years ago

I think we support this now. Any other introspection related bug reports should be separate now.

Closing :)

prisma / prisma1