Review and Feedback - Githubissues

tomschr commented 8 years ago

@sknorr: This may be probably interesting for you. ;-) We talked last week about this.

This is the very, very first draft of a "terminology system". The bare bones of the idea is there, it needs to be filled with the real implementation. I've created two terminology files, both as .fodt (Flat LibreOffice XML) and .csv. Fee free to change/correct/amend it.

Of course, test cases, documentation, and the Python infrastructure needs to be created too.

Some questions about this crazy idea:

Does it have any value?
Should we follow this path?
Naming?
If all of the above is ok, should we introduce it to our team?
Better separate the CLI tool into an admin tool (create, import, etc. some terminology data) and a user tool (just the query)?

Some further ideas to discuss:

Currently it is designed as a CLI tool, but maybe it would be better in the long term to offer this as a Web service
A CLI tool should still be there to access the service from the command line
Anything else?

ghost commented 8 years ago

Hi Tom,

this looks very interesting... (edited your post, so your questions are numbered)

Yes, I think it does.
I think it is promising, but I can't speak for everyone.
As long as you don't call it "you-shall-not-pass"...
It does not seem to work right now, so maybe we should work to close this functional gap.
Yes. For the most part, one or two people will enter stuff, and most people will just want to query.

Terminology List/Database Format

I am not sure about either the FODS or the CSV you provided as the source (though the CSV seems better). Having worked with an FODT on my thesis, I can tell you that LibreOffice is emphatically not git-compatible. It does not do minimal diffs, it will always rename what styles are called in your document source, and occasionally you get superfluous span-like elements. If you want useful diffs (e.g. to look up when an item changed) from a LibreOffice-compatible file, the CSV seems much better. At the same time, CSVs can't store any markup and are a bit hard to edit by hand (counting columns is not so fun), especially in the case of long CSVs. I guess they could be made to work though with enough toolchain for editing and validation though.

In any case, here are some ideas of what kinds of things the ideal format should support (wishful thinking, of course):

cross references (references between entries, ideally also from running text)
if we have cross references, we probably also need IDs (and the title is not enough -- the same entry might be in there multiple times in multiple contexts, once accepted, once rejected
ideally, a simple graphical editor should already be available (should allow editing as a table and/or on an individual page; support for making the table head sticky; in entries that only allow certain choices, it should allowing selecting only those states)
should be validatable
should be easy to transform to DocBook (for the style guide) and into a queryable format for the tool

Tool functionality

In terms of functionality, what I think would be necessary (could be separated among multiple executables):

User tool: Query the database(s) [in the style guide, we already have "general vocab" and "terminology"], load/connect to a database [maybe just being able to point the tool at URLs which it would then cache would be good]
Admin tool: Very much up in the air depending on the chosen input format. The one thing it would definitely need to do is convert input format to DocBook.

tomschr commented 8 years ago

Thanks for the feedback, great! :+1:

I am not sure about either the FODS or the CSV you provided as the source (though the CSV seems better).

Right, the .fods was just added as a format for "easier" editing. Although this could also be done by CSV, if possible.

[...]

I like all of your comments. Especially the separation of concerns (separate tools for our audience) is a good idea.

Should we define some workflows first? Or design the entries in our database? Maybe I start with some workflows first as it could also influence our database design.

The following describes the typical admin workflows. Not sure if it makes all sense, it is definitely not carved in stone. Feel free to correct it. :smiley:

Admin Workflows

For the time being, we define "database" as a file somewhere in the filesystem. It could be a pickle, shelve, or a sqlite file. Of course, the code should be easily extensible to anything different like MariaDB, PostgreSQL, or whatever database is out there. As long as there are Python bindings. :smirk:

Create a database

Call the termadmin script to create the database (file). The script expects a location, a name, and a type (pickle, shelve, or sqlite). The name is used to differentiate between different terminology databases.
Save the name, type, and location from step 1 into the config file ~/.config/suse/terminology.conf.
Create the empty database.

Import into a New Database

Call the termadmin script to import a CSV file. The script expects the name of the terminology database and the CSV file.
Read in type and location from the config file.
Open the CSV file and iterate through each row. This should be done with the csv module.
Add each row into the database. (Question: should IDs be autogenerated somehow? Or should the CSV file contain it?)

Import into an Existing Database

Call the termadmin script to import a CSV file. The script expects the name of the terminology database and the CSV file.
Read in type and location from the config file.
Open the CSV file and iterate through each row. Skip existing entries.
Add each row into the database.

Edit the Database

(Question: Not entirely sure about this)

Call the termadmin script. The script expects the name of the terminology database and some parameters(?)
Search the entry in the database. If it doesn't exist, raise an error message.
If the entry exists in the database, overwrite it with the parameters from step 1.

tomschr / termquery

Review and Feedback #1