serpro69 / kotlin-faker

Port of a popular ruby faker gem written in kotlin. Generate realistically looking fake data such as names, addresses, banking details, and many more, that can be used for testing and data anonymization purposes.
https://serpro69.github.io/kotlin-faker/
MIT License
468 stars 41 forks source link

Suggestion: Add a Database Provider #221

Closed mboisnard closed 6 months ago

mboisnard commented 7 months ago

Be able to generate database entries just like the faker js version:

Sources:

serpro69 commented 7 months ago

Hi @mboisnard ,

Thank you for the suggestion. This is something that's been on the back of my head for awhile now :) Along with generating generating csv, json, ...

If you like to work on this - please let me know. Else I'll try to prioritize this myself in the near future to make it available for the next release.

UPD: Oh wait, now that I looked at the links you've provided, I think I misunderstood what you meant by a "Database Provider" :D I was thinking of an interface that can be used to automatically populate a db table, for example.

As for a Database Provider that you meant. I can see that it could be useful, definitely. But what they have in fakerjs seems quite specific and narrow, which I don't think is good enough for a more "generic use-case" For example, if we take database collation, which db implementation are we talking about? The implementation in fakerjs doesn't seem to take that into account (e.g. https://github.com/faker-js/faker/blob/c1caa900ceb12737a3aa45b7e4dd75797a11a889/src/locales/base/database/collation.ts ) Column data types also vary from one db to another. Postgres doesn't have a storage engine like mysql, for example. And so on.

If you'd like to provide a "Database Provider" yml file that contains such information for various database implementations - please do so and I'll be happy to include this :)

For example, this is what chatgpt gave me:

postgresql:
  column:
    - id
    - name
    - email
    - created_at
    - updated_at
  type:
    - INTEGER
    - BIGINT
    - DECIMAL
    - NUMERIC
    - REAL
    - DOUBLE PRECISION
    - SERIAL
    - BIGSERIAL
    - CHAR
    - VARCHAR
    - TEXT
    - DATE
    - TIMESTAMP
    - TIMESTAMP WITH TIME ZONE
    - BOOLEAN
    - JSON
    - JSONB
    - BYTEA
    - ARRAY
    - UUID
    - ENUM
  engine: []
  collation:
    - "en_US.UTF-8"
    - "en_GB.UTF-8"
    - "de_DE.UTF-8"
    - "fr_FR.UTF-8"

mysql:
  column:
    - id
    - name
    - email
    - created_at
    - updated_at
  type:
    - INT
    - BIGINT
    - DECIMAL
    - FLOAT
    - DOUBLE
    - CHAR
    - VARCHAR
    - TEXT
    - DATE
    - DATETIME
    - TIMESTAMP
    - TIME
    - YEAR
    - BOOLEAN
    - JSON
    - BINARY
    - VARBINARY
    - BLOB
    - ENUM
    - SET
  engine:
    - InnoDB
    - MyISAM
    - MEMORY
    - CSV
    - ARCHIVE
    - BLACKHOLE
    - MERGE
    - FEDERATED
  collation:
    - "utf8mb4_general_ci"
    - "utf8mb4_unicode_ci"
    - "latin1_swedish_ci"
    - "latin1_general_ci"

mariadb:
  column:
    - id
    - name
    - email
    - created_at
    - updated_at
  type:
    - INT
    - BIGINT
    - DECIMAL
    - FLOAT
    - DOUBLE
    - CHAR
    - VARCHAR
    - TEXT
    - DATE
    - DATETIME
    - TIMESTAMP
    - TIME
    - YEAR
    - BOOLEAN
    - JSON
    - BINARY
    - VARBINARY
    - BLOB
    - ENUM
    - SET
  engine:
    - InnoDB
    - MyISAM
    - Aria
    - MEMORY
    - CSV
    - ARCHIVE
    - BLACKHOLE
    - MERGE
    - FEDERATED
    - TokuDB
    - Spider
  collation:
    - "utf8mb4_general_ci"
    - "utf8mb4_unicode_ci"
    - "latin1_swedish_ci"
    - "latin1_general_ci"

Is it comprehensive and accurate enough? I'm really not sure :D It's a start though, but I don't know if it's good enough so to speak.

Additionally, just in case you have a very specific use-case, I'd recommend you to take a look at creating your own data providers docs. This functionality is available since version 2.0.0-rc.1 and allows you to extend faker implementation and create your own data providers ;)

I'll still keep this issue open in case you or anyone else wants to work on this. Seems like a good "first issue" :)

mboisnard commented 7 months ago

Hello @serpro69 , thanks for your answer.

Yes actually I was talking about the same behavior as faker-js and I completely agree with you that the current implementation is generic and can be improved to match the possible data for each database.

I will take a look at your documentation, and try to contribute to the project :)

serpro69 commented 7 months ago

Contributions are always welcome :) Thanks!

I think this https://github.com/serpro69/kotlin-faker/blob/master/CONTRIBUTING.adoc#adding-new-functionality should help with the implementation of this issue. But also feel free to ask if you need any help.

As I mentioned, the bigger part of the task here would be to gather the data itself. After that you should be able to follow the above documentation to add a new data provider implementation; but if something is unclear there - please let me know. I'd like to improve the contributing guidelines also if they're not good enough.

Just a few suggestions also:

mboisnard commented 7 months ago

Thx for your suggestions, I created a branch to implement the databases behavior and I have several questions for you :)

For the MongoDB provider I would like to create a generateObjectId method based on a random date and inspired by the logic I found in JS here (https://steveridout.com/mongo-object-time/)

serpro69 commented 7 months ago

Hey @mboisnard ,

Let me give you some existing code examples to make things easier to understand.

serpro69 commented 7 months ago

Don't know if the above made much sense :grin: Feel free to ask if you want me to clarify something further :)