Suggestion: Add a Database Provider

mboisnard commented 7 months ago

Be able to generate database entries just like the faker js version:

collation, column, engine, type (from definition)
mongodbObjectId (from hexadecimal generated string)

Sources:

serpro69 commented 7 months ago

Hi @mboisnard ,

Thank you for the suggestion. This is something that's been on the back of my head for awhile now :) Along with generating generating csv, json, ...

If you like to work on this - please let me know. Else I'll try to prioritize this myself in the near future to make it available for the next release.

UPD: Oh wait, now that I looked at the links you've provided, I think I misunderstood what you meant by a "Database Provider" :D I was thinking of an interface that can be used to automatically populate a db table, for example.

As for a Database Provider that you meant. I can see that it could be useful, definitely. But what they have in fakerjs seems quite specific and narrow, which I don't think is good enough for a more "generic use-case" For example, if we take database collation, which db implementation are we talking about? The implementation in fakerjs doesn't seem to take that into account (e.g. https://github.com/faker-js/faker/blob/c1caa900ceb12737a3aa45b7e4dd75797a11a889/src/locales/base/database/collation.ts ) Column data types also vary from one db to another. Postgres doesn't have a storage engine like mysql, for example. And so on.

If you'd like to provide a "Database Provider" yml file that contains such information for various database implementations - please do so and I'll be happy to include this :)

For example, this is what chatgpt gave me:

postgresql:
  column:
    - id
    - name
    - email
    - created_at
    - updated_at
  type:
    - INTEGER
    - BIGINT
    - DECIMAL
    - NUMERIC
    - REAL
    - DOUBLE PRECISION
    - SERIAL
    - BIGSERIAL
    - CHAR
    - VARCHAR
    - TEXT
    - DATE
    - TIMESTAMP
    - TIMESTAMP WITH TIME ZONE
    - BOOLEAN
    - JSON
    - JSONB
    - BYTEA
    - ARRAY
    - UUID
    - ENUM
  engine: []
  collation:
    - "en_US.UTF-8"
    - "en_GB.UTF-8"
    - "de_DE.UTF-8"
    - "fr_FR.UTF-8"

mysql:
  column:
    - id
    - name
    - email
    - created_at
    - updated_at
  type:
    - INT
    - BIGINT
    - DECIMAL
    - FLOAT
    - DOUBLE
    - CHAR
    - VARCHAR
    - TEXT
    - DATE
    - DATETIME
    - TIMESTAMP
    - TIME
    - YEAR
    - BOOLEAN
    - JSON
    - BINARY
    - VARBINARY
    - BLOB
    - ENUM
    - SET
  engine:
    - InnoDB
    - MyISAM
    - MEMORY
    - CSV
    - ARCHIVE
    - BLACKHOLE
    - MERGE
    - FEDERATED
  collation:
    - "utf8mb4_general_ci"
    - "utf8mb4_unicode_ci"
    - "latin1_swedish_ci"
    - "latin1_general_ci"

mariadb:
  column:
    - id
    - name
    - email
    - created_at
    - updated_at
  type:
    - INT
    - BIGINT
    - DECIMAL
    - FLOAT
    - DOUBLE
    - CHAR
    - VARCHAR
    - TEXT
    - DATE
    - DATETIME
    - TIMESTAMP
    - TIME
    - YEAR
    - BOOLEAN
    - JSON
    - BINARY
    - VARBINARY
    - BLOB
    - ENUM
    - SET
  engine:
    - InnoDB
    - MyISAM
    - Aria
    - MEMORY
    - CSV
    - ARCHIVE
    - BLACKHOLE
    - MERGE
    - FEDERATED
    - TokuDB
    - Spider
  collation:
    - "utf8mb4_general_ci"
    - "utf8mb4_unicode_ci"
    - "latin1_swedish_ci"
    - "latin1_general_ci"

Is it comprehensive and accurate enough? I'm really not sure :D It's a start though, but I don't know if it's good enough so to speak.

Additionally, just in case you have a very specific use-case, I'd recommend you to take a look at creating your own data providers docs. This functionality is available since version 2.0.0-rc.1 and allows you to extend faker implementation and create your own data providers ;)

I'll still keep this issue open in case you or anyone else wants to work on this. Seems like a good "first issue" :)

mboisnard commented 7 months ago

Hello @serpro69 , thanks for your answer.

Yes actually I was talking about the same behavior as faker-js and I completely agree with you that the current implementation is generic and can be improved to match the possible data for each database.

I will take a look at your documentation, and try to contribute to the project :)

serpro69 commented 7 months ago

Contributions are always welcome :) Thanks!

I think this https://github.com/serpro69/kotlin-faker/blob/master/CONTRIBUTING.adoc#adding-new-functionality should help with the implementation of this issue. But also feel free to ask if you need any help.

As I mentioned, the bigger part of the task here would be to gather the data itself. After that you should be able to follow the above documentation to add a new data provider implementation; but if something is unclear there - please let me know. I'd like to improve the contributing guidelines also if they're not good enough.

Just a few suggestions also:

For collation it could be impractical to include all possible values in the .yml file. What we could do instead is use the locale value from the faker's configuration, and using that "construct" possible collation values. E.g. for postgres we'd probably only need to append .UTF-8 to the locale string. For mysql/mariadb some "conversion logic" from locale to collation would probably be needed. The other db types IDK, would need to check what are the possible values there and how to return them in a nice way.
For columns I'm not entirely sure what's a good "list of common column names" or what is even the use-case here. Feel free to submit some proposals from your end :) Also it doesn't need to be a separate property for each db type, since the values will be the same I guess
For type and engine (where applicable), they can be added to the .yml directly. I think this would be the easiest approach for these two properties

mboisnard commented 7 months ago

Thx for your suggestions, I created a branch to implement the databases behavior and I have several questions for you :)

For the MongoDB provider I would like to create a generateObjectId method based on a random date and inspired by the logic I found in JS here (https://steveridout.com/mongo-object-time/)

MongoDB Provider is not based on a yaml file, so I would like to implement the AbstractFakeDataProvider class just like the StringProvider for example in the databases gradle module I just created. The AbstractFakeDataProvider class is marked as internal, is it intentional or have you not yet had the need?
To be able to generate an objectId I would like to add a new method in the RandomService to generate an OffsetDateTime that can be used by anyone and by the MongoDB Provider. Can we access to the RandomService from a provider? (just removed the internal protection in FakerService for this field to make it work on my branch)

serpro69 commented 7 months ago

Hey @mboisnard ,

Let me give you some existing code examples to make things easier to understand.

Creating a new data provider that is not yaml-based outside of "core faker" is not supported. I'm not sure it makes much sense either to expose those things. Seems like a very specific use-case.
- What I could suggest instead is having one DatabaseProvider implementation, which contains both common functionality, as well as specifics for the various <DatabaseType>Providers accessible via additional property (take a look at https://github.com/serpro69/kotlin-faker/blob/5106afe80cf16d43b0370e5cc3558a91d0850029/faker/edu/src/main/kotlin/io/github/serpro69/kfaker/edu/provider/Educator.kt#L23 for example)
- This, however, doesn't solve the part that "MongoDbProvider" isn't going to implement YamlFakeDataProvider . If that is intentional, and you only want to have this one function generateObjectId for the mongo-db provider, I can think of two ways:
- Place the function in the DatabaseProvider instead and name it mongoDbObjectId, for example. This way you will have DatabaseProvider based on yaml, but you can also have functions inside it that don't use data from yaml files.
- Kind of hacky, but you can just omit this part in the MongoDbProvider https://github.com/serpro69/kotlin-faker/blob/5106afe80cf16d43b0370e5cc3558a91d0850029/faker/edu/src/main/kotlin/io/github/serpro69/kfaker/edu/provider/Educator.kt#L49-L51 and still inherit from YamlFakeDataProvider.
- I think the latter would work (haven't tried it myself though), but I'd go with the former as it's cleaner and it's perfectly fine to have functions that read from yml and that don't in the same provider implementation (see e.g. Internet#iPv4Address - https://github.com/serpro69/kotlin-faker/blob/5106afe80cf16d43b0370e5cc3558a91d0850029/core/src/main/kotlin/io/github/serpro69/kfaker/provider/Internet.kt#L48-L49 which is a custom function not based on yml-data, but is inside a YmlFakeDataProvider implementation class)
To get access to RandomService from a data provider implementation, you can use this as an example:
- First add it as a constructor parameter - https://github.com/serpro69/kotlin-faker/blob/5106afe80cf16d43b0370e5cc3558a91d0850029/faker/books/src/main/kotlin/io/github/serpro69/kfaker/books/provider/Dune.kt#L15-L18
- Then in the faker, you can use the randomService property that is available from the AbstractFaker - https://github.com/serpro69/kotlin-faker/blob/5106afe80cf16d43b0370e5cc3558a91d0850029/faker/books/src/main/kotlin/io/github/serpro69/kfaker/books/BooksFaker.kt#L39

serpro69 commented 7 months ago

Don't know if the above made much sense :grin: Feel free to ask if you want me to clarify something further :)

serpro69 / kotlin-faker

Suggestion: Add a Database Provider #221