juji-io / datalevin

A simple, fast and versatile Datalog database
https://github.com/juji-io/datalevin
Eclipse Public License 1.0
1.07k stars 60 forks source link

[QUESTION]: separate the client into a small lib? #233

Closed jimpil closed 5 months ago

jimpil commented 6 months ago

Hi there,

First of all many thanks for all the work you've put into this!

Secondly, correct me if I am wrong, but if I want to target datalevin from more than one API, it has to run in network mode - i.e. server VS client(s). But then, those APIs, all need to depend on the entire datalevin lib (in order to use the client namespace)?

If this is right, would it not make sense to split out the client code into its own little lib? The dependency list on the project.clj file isn't small, it includes things like timbre, and generally speaking, stuff that is (or should be?) unnecessary to clients.

I'd love to hear your thoughts - it sounds like a no-brainer to me, but I may be missing something... Thanks again :+1:

huahaiy commented 6 months ago

I am open to have an independent client library if someone is willing to take on the work.

jimpil commented 6 months ago

I wouldn't mind taking on the work, but the way I see it, you have already done it. To put it differently, I wasn't referring to a brand new client, but rather pulling out the existing/official client into its own lib. For someone else (other than you) to do that, he/she will need to understand your 'vision' (for lack of a better term). For example, are you willing to split the project into 4 libs?

If you are willing to do that, then I can certainly help, as I would only be re-arranging existing code - i.e. I wouldn't have to change any existing code, nor add anything new (just shift the existing stuff around). If, on the other hand, you're not willing to do that, and want a truly separate library, then it's not clear to me how to achieve this w/o duplicating a bunch of code (which then needs to be maintained and synced between two projects).

Anyway, I don't want to get too deep into suggestions/solutions etc just yet. My main point is that I am more than happy to help, as long as you provide some sort of vision and/or guidance. I hope this makes sense...

Thanks again :)

huahaiy commented 6 months ago

If you mean to produce multiple jar artifacts from the same code base, we can certainly do that.

jimpil commented 6 months ago

Cool, so I think the first step towards that direction would be to impose a folder/package structure, as per my previous comment. Basically, something like this:

This will help us identify whether this

  1. is possible
  2. is worth doing (if core ends up being huge, what's the point)

I can have a go at this over the xmas holidays (hopefully). However, if this is indeed possible and worth doing, these 'modules' will (eventually) have to live in their own repo, have their own project.clj file etc, right? Or is there tooling that can help with releasing these separately?

huahaiy commented 6 months ago

I wouldn't want to have separate repos for these.

jimpil commented 6 months ago

Sorry, I meant in its own project/folder. So it would be a monorepo of a parent-project, containing 3 child-projects. This way you will be able to build each project separately (including the parent). It could look like this (w/o using any lein plugins):

At this point, the development flow hasn't really changed (it's still kind of a monolith). The same goes for releasing something that looks like today's artifact (i.e. with everything in). The difference is that you can now go into modules/client, and build/release it separately (same for server). I can't actually think of a use-case/reason for releasing core on its own, but it will be possible too.

Finally, i just want to say that there are various related lein-plugins (lein-monolith, lein-parent etc) that can do/help with this. I haven't had to use any of them personally, but maybe you have, and can think of an easier route?

Any thoughts?

huahaiy commented 6 months ago

We are already doing that for native version of Datalevin.

On the other hand, I am not convinced this is something worth doing. Putting these into separate artifacts has marginal benefits at best, but also increase mental overhead on the part of users.

The intention of my design is give user a seamless experience switching between embedded use and server/client use, by simply changing the dir argument. Now, they have to require different libraries to switch mode, and increase the chances of making mistakes. Not mentioning the complexity introduced to properly test and release the artifacts. All this is just to save a few kb in the jar? I don't think it makes too much sense to be honest.

Datalevin is a simple project, and I don't want to give people the impression that this is complicated and having many moving parts.

huahaiy commented 6 months ago

With that said, I am still open to releasing the client as a small library as an additional artifact, for people who will only use Datalevin in a client/server setting.

jimpil commented 6 months ago

The intention of my design is give user a seamless experience switching between embedded use and server/client use, by simply changing the dir argument. Now, they have to require different libraries to switch mode, and increase the chances of making mistakes. Not mentioning the complexity introduced to properly test and release the artifacts.

I could argue that switching between embedded VS network mode should not be so seamless for anything. There is a reason literally all other DBs (that can work in network mode), make that distinction. But even if you disagree with that, I actually never suggested that you abandon the 'embedded' variant - just build it up from better separated pieces.

All this is just to save a few kb in the jar? I don't think it makes too much sense to be honest.

Well, it's not really about the kilobytes in absolute numbers. You probably know this better than me, but I 've had a skim through the code, and by my rough estimate the 'server' stuff is close to 70% of the code-base. I can't say how many kb that is, but I do know that I don't want unnecessary stuff on my classpath. I would like to have that 30% VS 70% where they respectively belong - you can understand that right?

Datalevin is a simple project, and I don't want to give people the impression that this is complicated and having many moving parts.

I honestly don't think that the server VS client distinction would make people think datalevin has many moving parts. In fact, the very reason we are here discussing this, is because I found it bizarre that it doesn't make that distinction! It is perfectly normal for a piece of software that can be used in various scenarios/contexts, to be decomposed into the natural smaller pieces for those scenarios/contexts.

With that said, I am still open to releasing the client as a small library as an additional artifact, for people who will only use Datalevin in a client/server setting.

Ultimately, this is what I'm interested in. I still not understand why the server has to have any client-related namespaces, but that's a much smaller deal than the opposite, so at this point, I have to ask... From a technical perspective, how do you see this working? How can this achieved w/o splitting the code-base?

Thanks again...

huahaiy commented 6 months ago

You obviously have vastly overestimated how much "server" related code is. Looking at the "index" branch that I am currently working on that will be released as 0.9.0, "datalevin.server" is a single namespace that contains 2305 lines of Clojure code, whereas we have 19848 lines of Clojure code and 28091 lines of Java code in the code base. So an estimate of 70% of server lated code is totally out of proportion.

The overarching goal of Datalevin is to achieve better ergonomics than current generation of databases. Whatever other databases do or not do is not part of my concerns. I want my database to be simple to use, having a seamless experience without the need to change code when I decide to switch between embedded and client/server use. In the future, we will add distributed mode and the experience will be the same.

I don't want the database to be a big deal that demands lots of ceremony. I am an human-computer interaction researcher at heart and my concerns are human factors. Whatever specializations in computer science is but some means to the ends of better user experience. Whatever other things people care about is not my concern. I hope this much is clear.

jimpil commented 6 months ago

You obviously have vastly overestimated how much "server" related code is. Looking at the "index" branch that I am currently working on that will be released as 0.9.0, "datalevin.server" is a single namespace that contains 2305 lines of Clojure code, whereas we have 19848 lines of Clojure code and 28091 lines of Java code in the code base. So an estimate of 70% of server lated code is totally out of proportion.

Well, ok the datalevin.server namespace might be about 2k lines, but it also depends on another 9 datalevin.* namespaces, so to imply that the server-related code is only ~10% of the codebase, is not very accurate either. In any case, it's not worth arguing over this - the only way of calculating this reliably would be to actually split the codebase, produce two separate artifacts, and finally compare their sizes. I personally cannot think of a single project that has the server VS client distinction, where the client is the bigger artifact. The meat & potatoes is usually (always?) the server.

The overarching goal of Datalevin is to achieve better ergonomics than current generation of databases. Whatever other databases do or not do is not part of my concerns. I want my database to be simple to use, having a seamless experience without the need to change code when I decide to switch between embedded and client/server use. In the future, we will add distributed mode and the experience will be the same.

How is it better ergonomics, or simpler to use, when every single client API needs to depend on a giant JAR, that includes the server stuff (+ timbre etc)? Even if the split is 50-50, I don't see how this is simpler, more productive, or even a good practice. Of course this is your project, and you can do whatever you want with it, but I do want to understand where you're coming from.

I don't want the database to be a big deal that demands lots of ceremony. I am an human-computer interaction researcher at heart and my concerns are human factors. Whatever specializations in computer science is but some means to the ends of better user experience. Whatever other things people care about is not my concern. I hope this much is clear.

I think we are using the word ceremony in slightly different ways. You seem to think that choosing between a client VS server artifact to import, is a big deal, whereas I think that importing a single (big) artifact that includes everything is a big deal. It is not my intention to try to convince you, or anything like that, but you did say in a previous message that you're still open to having an additional client library, so I'm asking once again - how do you see this working? What kind of work would you want to see towards that goal?

den1k commented 6 months ago

@jimpil, @huahaiy is by far the main contributor to datalevin. Yes, your point is not unsound but it’s kind of like boiling the ocean. There are more than a dozen nice to have’s for datalevin in the realm of less deps, module style code splitting, etc so the question is not what would be better but what’s essential and unless someone steps to contribute at a significant level the direction of the project is largely defined by @huahaiy.

Splitting deps is a menial, largely language level refactor task, likely not very specific to the datalevin codebase. So if you want client and server to be split up fork the repo and give it a shot.

jimpil commented 6 months ago

@den1k

@huahaiy is by far the main contributor to datalevin.

I understand, and respect that.

... and unless someone steps to contribute at a significant level the direction of the project is largely defined by @huahaiy.

First of all, the direction/vision of the project still lies with @huahaiy, even if someone steps to contribute at a significant level , as you put it. Secondly, I did offer to contribute, as long as there was a clear vision/guidance.

Splitting deps is a menial, largely language level refactor task, likely not very specific to the datalevin codebase. So if you want client and server to be split up fork the repo and give it a shot.

I am not going to fork anything, and do any work that will likely be rejected when I open the PR. I prefer to coordinate with the author/maintainer on an agreed design/solution first, and that's precisely what I'm trying to do here. I did offer a sketch of a potential approach of splitting the codebase, with the ultimate goal of releasing smaller/more-focused artifacts, but it wasn't met with enthusiasm, so at this point I'm literally out of ideas, and seeking guidance (i.e. some sort of high-level approach)...

huahaiy commented 5 months ago

Requiring a single library is a lot less ceremony than having to figure out which namespaces should I require for my specific use case. That requires a user to understand how the project is structured and so on. The mental effort required is more than what is necessary. I always hate it when a Clojure library requires multiple namespaces to use. A user cares about his own project, he doesn't and shouldn't care about the details of the dependent project. We are not writing Java, after all. If one like modularity and many many files so much, she or he should probably stick with Java. Again, my concerns are human factors.

As I said, I am open to producing an additional "client" artifact from the code base when releasing it, with necessary tests on the client artifact. The work will be mostly a minor refactoring of the code and adding a project.clj file. Basically, the client artifact probably needs to include client, remote, protocol, lmdb, and the protocals portion of the storage namespace, as far as I can tell. Maybe even the core namespace. The rest can probably be ignored. This will also be JVM library only, so we don't need to worry about native part of the code.

bzg commented 5 months ago

Having followed this conversation, and chiming in as a simple user, I very much agree with @huahaiy on this:

Requiring a single library is a lot less ceremony than having to figure out which namespaces should I require for my specific use case.

huahaiy commented 5 months ago

BTW, server namespace is in fact all that there is for the server portion of the code. The other namespaces required by server will be there even if we don't have client/server mode, the only exception is protocol, which is shared code between server and client. So really, the claim of "server part is too heavy" is completely false.

The real argument and the reason I allow an additional "client" artifact, is that for people only using client/server mode, there's no need to require all that code for embedded use. The code to support embedded use is the backbone of the code base as it implements the database proper, but they are not needed if one only needs to use a Datalevin client to talk with a remote Datalevin server.

jimpil commented 5 months ago

If one like modularity and many many files so much, she or he should probably stick with Java.

The whole reason I opened this issue in the first place, is so that I could have less files on my api classpaths.

As I said, I am open to producing an additional "client" artifact from the code base when releasing it, with necessary tests on the client artifact. The work will be mostly a minor refactoring of the code and adding a project.clj file. Basically, the client artifact probably needs to include client, remote, protocol, lmdb, and the protocals portion of the storage namespace, as far as I can tell. Maybe even the core namespace. The rest can probably be ignored. This will also be JVM library only, so we don't need to worry about native part of the code.

Ok, we seem to be getting somewhere...How much work do you think that might be (roughly), and perhaps more importantly, do you need help with any of it?

BTW, server namespace is in fact all that there is for the server portion of the code. The other namespaces required by server will be there even if we don't have client/server mode, the only exception is protocol, which is shared code between server and client. So really, the claim of "server part is too heavy" is completely false.

My claim was not that it is too heavy, but rather, unnecessary for clients. Also, at first glance, one namespace depends on 9, whereas the other depends on 4 - it only makes sense to think/guess that one is 'heavier' than the other (not to mention that servers in general are 'heavier' than clients).

The real argument and the reason I allow an additional "client" artifact, is that for people only using client/server mode, there's no need to require all that code for embedded use. The code to support embedded use is the backbone of the code base as it implements the database proper, but they are not needed if one only needs to use a Datalevin client to talk with a remote Datalevin server.

Ok, that is very interesting - I hadn't thought of it this way. The target remains the same (avoid unnecessary code for client/server consumers), but I wasn't aware that embedded usage was 'a different thing' (for lack of a better term). I was under the impression that embedded usage was just a 'consequence' of bundling the server/client together. In any case, although an interesting/surprising turn of events, this doesn't change much, as far as goals go.

huahaiy commented 5 months ago

This is a ticket low on my priority list, so I would appreciate it if someone steps up and does the work.

jimpil commented 5 months ago

Ok, I'll have a go at it...Just so we're both on the same page - given what you said in a previous message, here is my initial plan:

  1. create a top-level folder containing a single file (i.e. project.clj) - this can be called something like client-artifact? The :src-paths defined in that project.clj will be something like ["../src/datalevin/common" "../src/datalevin/client"].
  2. restructure the datalevin namespaces so that the source-paths mentioned above exist - i.e. create the common/client packages, and put the appropriate namespaces in, hopefully w/o having to shuffle any functions around (i.e. no code-changes).

Does that make sense, do you see it working, and perhaps more importantly, is that what you had in mind? Finally, do you want me to work off master, or some other branch?

huahaiy commented 5 months ago

I would do something similar to native, which refer to the main project as the parent project. You probably will need a new datalevin.client namespace that delegate calls to datalevin.core to have full functionality, in addition to its server management functionality.

Also, there's a release.clj file that does all the testing, packaging and releasing. I would add the client piece there.

Working off master is fine.

jimpil commented 5 months ago

Apologies, but I'm not sure I fully follow...

I would do something similar to native, which refer to the main project as the parent project.

Yes, I see that, but I can also see that its :src-paths bring in everything from the parent project. What kind of :src-paths do you see for client?

You probably will need a new datalevin.client namespace that delegate calls to datalevin.core to have full functionality, in addition to its server management functionality.

You mean in this new 'child' project? on't that conflict with datalevin.client proper?

Also, there's a release.clj file that does all the testing, packaging and releasing. I would add the client piece there.

This is low-hanging fruit at this point (i.e. priority-wise) - I'd like to understand of how building might/will work, before considering releasing.

If the project had a common/server/client structure, a simple :client-uberjar profile with :jar-exclusions would suffice, no?

huahaiy commented 5 months ago

You need to figure out a way to bring in only things you need. Rearranging directory structure is fine, just don't go overboard. It will be fine as long as they are internal changes not affecting users.

jimpil commented 5 months ago

I am really sorry, but I am unable to progress with this. I've had a couple of attempts, but I can't figure out a way of doing this simply, and w/o breakage. The project is actually more than complex than I anticipated, so I am going to leave this to someone more familiar with it...

kind regards