Standard DB interfaces - Githubissues

kmadathil commented 6 years ago

Quoth @vvasuki in #9 " Taking the flexibility preference to a slightly higher level - it is a good idea not to be "married" to any database technology. Access it via an interface (such as DbInterface and ClientInterface here - PS: you don't have to implement every method). Switching to a different database tool should be as simple as calling a different class's constructor - one shouldn't have to go messing about anywhere else."

codito commented 6 years ago

peewee is a good lightweight ORM that may help. It abstracts the interaction with underlying db (sqlite, pgsql etc.)

vvasuki commented 6 years ago

Albeit, we want to use no-sql databases to store json docs. https://github.com/coleifer/peewee/issues/434 suggests that peewee doesnt support such.

codito commented 6 years ago

Yes, it is sql only.

avinashvarna commented 6 years ago

The sanskrit_util module makes use of sqlite and ORM for its data. Would switching over to use that fully be an acceptable solution? The data we are currently using on the sanskrit_util branch includes an old version of the INRIA data. I have filed an issue to update it to include the latest INRIA data.

Given that it is SQL, will that work, or do we continue to think about nosql solutions?

vvasuki commented 6 years ago

If you're using SQL and intend to continue to use SQL (which I think is a bad choice given that nosql dbs are a simpler design than sql + orm) - yes. Otherwise, no.

codito commented 6 years ago

(just my 2 cents :)) I'd prefer sqllite for following reasons:

Data access patterns in this case are purely local and read heavy (i.e. a python module accesses the data locally stored)
Data may be versioned, sql has well established servicing patterns
Why sqllite? Smaller size, zero dependency, bundled/redist, comes as a built-in python module
I don't think we use transactions, locks or other heavy weight ACID capabilities (data is mostly immutable for us). We may want faster lookups - indexes will help, we may want joins too
ORM layer keeps us future proof: say the data is humongous and we prefer to host it independently. We could write an adapter to query the web interface of DB and produce plain objects. As long as rest of the code uses objects, we require minimal change and nothing breaks. We can deal with scale problem if it comes later in future

vvasuki commented 6 years ago

I should point out that the following makes zero difference in the sql-nosql choice, and should be disregarded:

Why sqllite? Smaller size, zero dependency, bundled/redist, comes as a built-in python module (my note: compare https://pypi.python.org/pypi/tinydb )
Data access patterns in this case are purely local and read heavy (i.e. a python module accesses the data locally stored)
I don't think we use transactions, locks or other heavy weight ACID capabilities (data is mostly immutable for us). We may want faster lookups - indexes will help, we may want joins too

And the following actually favors nosql:

ORM layer keeps us future proof: say the data is humongous and we prefer to host it independently. We could write an adapter to query the web interface of DB and produce plain objects. (my note: you are postponing the problem to the future, and commit to keep up the ORM layer to do translation for eternity.)

From my ancient sql experience, I did not understand what was meant with "Data may be versioned, sql has well established servicing patterns" ..

A good choice would have to consider factors mentioned in https://www.couchbase.com/resources/why-nosql - mainly flexibility and non-ugliness (ie why one finds json naturally more intuitive than data produced by a few joins).

codito commented 6 years ago

Why sqllite?...

Initially I thought about a dependency on mongodb or the likes. Installed size comes to 136M on a linux box :) But tinydb or the likes are probably better for embedded use. We can disregard this point since they are both smaller.

Data access patterns... I don't think we use transactions...

In my opinion, these tradeoffs are probably one way to make a sql vs nosql choice. Have used both in production in different cases (the recent one is a sql paas that stores TBs). The ugliness factor is limited to only database layer, often the customer doesn't see it :) They do notice the performance and reliability. Downside of sql: ACID provides guarantees, costs performance. Arstechnica does cover this well in first part of this article.

ORM...

Both SQL and NoSQL cases may use ORM. This is probably good design, rather than a pro/cons for any db technology choice. It's just simple layered architecture where the database concepts are limited to one layer, all the other components deal with plain old objects. Objects stay decoupled and forever.

postponing the problem...

Sorry if it appeared so, my intention isn't that. I am only attempting to evaluate quantitatively based on a possible use case. As I mentioned earlier, having used nosql, I am not against it at all :)

Data may be versioned and servicing patterns

We face an interesting problem in production with n instances of application tier, m instances of data tier, p instances of connected clients. Servicing refers to updating these pieces in an order so that upgrade is live and provides 99.99 availability. More often this fails hence it is essential to rollback to ith version of the schema. (not saying this is not possible with nosql, probably sql patterns have stayed for few more decades longer)

Coming back to our use case. Are these assumptions correct? How do they map to the future vision of this library?

Mostly the data will be around dhatu forms, or the INRIA data?
Will this data be mutated by customer input? E.g. new dhatu forms, or a new data source
Are these related data or fairly isolated from each other? E.g a morpho engine could use multiple sources based a common identity e.g. the dhatu?
What's the typical size? I am assuming we will not hold these in-memory since they will likely go into tens of MBs? So how will the look up be - e.g. on a key; looking at the DhatuWrapper, this is probably DAtuH field.
What are key tradeoffs we're willing to make?
- Performance: what's our threshold for a lexical analysis for example - a second, sub-second, minute?
- Reliability/consistency: same answer for same inputs forever :)
- Data storage format: json, binary, etc.. This is how the technology stores the data, not how we show it to the user or another library.
- Security probably doesn't apply since local transactions

vvasuki commented 6 years ago

Well, we just have to focus on the main thing matters to us - programmers who are contributors and users of this package. As you rightly said: "Data storage format: json, binary, etc.. This is how the technology stores the data, not how we show it to the user or another library." Speed, ACID and such concerns fade away into the background in comparison.

sanskrit_parser will interact with the rest of the world mostly through json (or something like that, - say a protocol buffer). This is simplest if it used, produced and consumed json natively. We should not have to spend our time mucking around with sql (it is the library's headache how it stores this stuff internally) : we should be able to say "give me details of this pada or dhAtu or sentence" and get such detail in the most convenient (json-like) form, which can then be mechanically deserialized into python objects (using jsonpickle or a wrapper thereof, which I suppose is subsumed by "ORM"?). You rightly say: "The ugliness factor is limited to only database layer, often the customer doesn't see it", to which I say - even we shouldn't need to deal with it.

My experience is that sanskrit data gets mutated a lot, and flexibility is important - with json this becomes as simple as adding or moving a sub-object or a field; while in sql, you'd add a new table, define a join and then write a module to make a json object out of it.

I've looked at versioning in the context of mongodb, which does not natively support versions - the solutions seemed quite simple (basically have a seprate version db). In our case, I think that any release of sanskrit_parser code will expect a certain specific version of some data - else it will prompt an upgrade - no real need to go back and forth on versioned data. (This has been the same in case of https://github.com/sanskrit-coders/stardict-sanskrit/ as well)

kmadathil commented 6 years ago

sanskrit_parser will interact with the rest of the world mostly through json (or something like that, - say a protocol buffer).

@vvasuki : Are you talking of the API mode, or do you think it's better to have a JSON wrapper for programmatic python access as well?

Coming back to our use case. Are these assumptions correct? How do they map to the future vision of this library? @codito: Mostly the data will be around dhatu forms, or the INRIA data?

Data will be dhatu related (dhatupAtha etc.) and form related (INRIA, sanskrit_utils, the neural net L0 that we've plans for ...)

Will this data be mutated by customer input? E.g. new dhatu forms, or a new data source

Can't see how at the moment

Are these related data or fairly isolated from each other? E.g a morpho engine could use multiple sources based a common identity e.g. the dhatu?

Yes - I could see multiple sources being used. We already use dhAtupAtha for some dhatu information, and another db for forms of the same dhAtu.

What's the typical size? I am assuming we will not hold these in-memory since they will likely go into tens of MBs? So how will the look up be - e.g. on a key; looking at the DhatuWrapper, this is probably DAtuH field.

Take a look at the current ~/.sanskrit_parser/data directory, I expect filesizes to stay similar.

What are key tradeoffs we're willing to make? Performance: what's our threshold for a lexical analysis for example - a second, sub-second, minute? Reliability/consistency: same answer for same inputs forever :) Data storage format: json, binary, etc.. This is how the technology stores the data, not how we show it to the user or another library. Security probably doesn't apply since local transactions

My expectation would be reliability, followed by performance.

@vvasuki has more experience in the interactions between various such projects

vvasuki commented 6 years ago

@vvasuki : Are you talking of the API mode, or do you think it's better to have a JSON wrapper for programmatic python access as well?

API mode should definitely be JSON-faced.
Programmatic python access should be json-serializable python objects (more or less every python object if you use jsonpickle type library).

kmadathil / sanskrit_parser

Standard DB interfaces #61