Closed SemyonSinchenko closed 3 days ago
Hi @SemyonSinchenko
You raise a good point. I am no lawyer but I agree with your conclusion that this library would be bound by the same terms as dbltdatagen
.
Initially I made this for a personal project but since this library is starting to gain a small number of stars and I would hate for it to be useless for others based off their usage of Databricks, I'll look at replacing the dbltdatagen
library with something more permissive. The Faker
library uses MIT so its probably a better fit.
The API for generating data is likely to change since the initial once was designed around the API for dbldatagen
but will do my best to keep the feature set in parity.
Thanks for being thorough in raising these licence concerns.
Hey @mitchelllisle, would you consider removing the data gen capability from this package altogether. IMO there are a few good reasons for this:
SparkModel
will be more cohesive with a single responsibility (simply a translator between pydantic models and spark schemas), and therefore has fewer reasons to changeRegarding the last point, polyfactory and/or Faker
(as you suggest) are most common. Whether or not dbldatagen
is replaced by one of these (or something else), the data gen functionality could be split out into another package in the future as another consideration.
I think removing the data gen component will make this package viable for its first major version release 🥳
@mitchstockdale I'm not opposed to this at all. Originally I wanted to bundle both of these features since thats what I was after for my project - but I think in hindsight it makes sense to either make this feature optional by install extras (I.E by pip install sparkdantic[datagen]
or by creating another repo as you suggested.
Either way I think v1.0.0
could ship without it to begin with. I'll leave this open for a week or so and if there are no objections I can release with this removed. It will also resolve @SemyonSinchenko licence concerns which I haven't had time to resolve this year. Removal is much quicker.
Thanks for the suggestion
Tagging some other contributions for any feedback they might have @dan1elt0m @chidifrank
Hey @mitchelllisle, Thanks for tagging. Frank and I have already had a discussion on this topic, and we believe that the datagen feature doesn't make much sense for our needs. Instead, we recommend using Polyfactory, which offers more flexibility and features. FYI, Sparkdantic and Polyfactory work really well together. Have applied it to many complex models without any problems.
Hey @mitchelllisle, Thanks for tagging. Frank and I have already had a discussion on this topic, and we believe that the datagen feature doesn't make much sense for our needs. Instead, we recommend using Polyfactory, which offers more flexibility and features. FYI, Sparkdantic and Polyfactory work really well together. Have applied it to many complex models without any problems.
Can confirm.
especially when you want to generate data for more complex types like arrays the datagen library of databricks is not very intuitive
With polyfactory + sparkdantic
@mitchelllisle
Great - sounds like we're all in agreement. Happy to remove it and do a v1.0.0
release this weekend
New release has been cut https://github.com/mitchelllisle/sparkdantic/releases/tag/v1.0.0
Thanks for the time and effort you've all put in on this little library 😄
@SemyonSinchenko I'll close this issue off now since I believe the licence is more aligned now. Let me know if you have any other concerns
Thank you again for that nice library! The overall idea looks very nice!
I was checking the code and found that
sparkdantic
depends ofdbldatagen
. But that library is distributed under commercial Databricks License. Are terms ofdbldatagen
license applied to thesparkdantic
too? In other words, can I usesparkdantic
outside of Databricks Platform?