Design the synchronization of words to central storage

gustafl commented 8 years ago

Alert This is a work in progress.

Why have a centralized storage?

The fundamental idea of Lexeme is to allow the user to pick his own material, and to build his own mental lexicon using this material. Each user's mental lexicon needs to be stored somewhere. While we plan to make use of local storage for caching words and documents, we also need a centralized storage, for the following reasons:

Users expect to be able to access their mental lexicon from various devices.
Local storage is fragile and may be erased for various reasons. A centralized storage acts as a backup.
A centralized storage allows us to compare users and measure their performance for statistical and gamification purposes.
There may also be monetary reasons for a centralized storage. If it's all stored locally, we can't charge subscription fees.
A collective lexicon

Users will probably want some form of verification that the words they put in their mental lexicons are words you can find in dictionaries, and that they are spelled correctly. And if we wish to compare users based on how many lexemes they have in their mental lexicons (or how many they accumulated during the last day, week or month), we need a way to validate those lexemes. This means we need something to validate against – a real lexicon. How do we get hold of a real lexicon?

One idea would be to generate a collective lexicon based on everybody's mental lexicon, and then use this as the normative source to validate against. This could be done at regular intervals, like once a day. For a word to be added to the collective dictionary, or at least suggested as an addition to a human moderator, it would have to be in the mental lexicon of multiple users (say 10). With time and enough users, the collective lexicon would become increasingly complete. However, the first pioneering users will be in a situation where almost no word is valid.

Perhaps we still need some human intervention before words are added permanently to the collective lexicon. The backend may generate a list of candidate words to add (e.g. words that now occur in the mental lexicon of 10 users), which is then processed by a human language moderator. This would also remove other potential problems, such as previously valid words becoming invalid, and the risk of exploits and trolling in the collective lexicon.

I quite like the idea of a collective lexicon. It's interesting for statistical and research purposes. Even if we could get hold of good dictionaries somewhere, and the license fees wouldn't be too high (which is unlikely), I'd still want to try the collective lexicon.

The website

Another reason for having a centralized storage is that it can provide users with a web application, where the user can browse through words she has registered, practice them, look at fun statistics, diagrams, perhaps scoreboards, gamification badges etc.

Practice

The practice should be made as effective as possible. One of the greatest time-wasters with current solutions is that there is no "don't ask me this again" feature. Words marked as mastered should not turn up in the practice unless asked for, and there must always be an option to say "I know this, don't ask me again". The most effective exercise is typing translations from the native to the foreign (new) language. Clicking on written words (like in Memrise) may feel more effective, but it doesn't actually test your skill.

The Khan Academy's way of proving mastery is a good solution (i.e. answer correctly 5 times in a row, on 5 different days).

Synonyms

One question is how to handle synonyms in vocabulary practice. The user will get errors on many valid answers, unless he has registered those answers as valid translations. This may seem like a total show-stopper, but it's actually in line with the leading idea of Lexeme – to let the user build up his mental lexicon. Building up a mental lexicon includes the task of discovering synonymous translations to a word.

gustafl commented 8 years ago

Coming back to this project after a long pause, I find that my thinking on a collective lexicon has changed a great deal. The key to solve all problems regarding mapping words to a collective lexicon, is to admit that we don't need it. The words in the document can be assumed to be correctly spelled. In practice, there will be almost no exceptions as long as users use reliable source material. And even when a misspelt word occurs

So, there is no need for a normative source to validate words against; no need for regular validation events to compare or synchronize mental and collective lexicons; no need to compare mental lexicons to find out which words are candidates for the collective lexicon; no need for human language moderators. Each user has her own mental lexicon, and that's it.

Bringing a normative lexicon into the application would also make the task of describing words in the form feel more like wasted effort. Letting each user create their own lexicon, with no links or mappings to external sources, will be a more enlightening and rewarding experience. Users will learn more from registering each word by themselves, using the form, instead of simply looking them up.

So, are there any drawbacks? Yes, but I believe they are acceptable:

Users may store bad data. Non-standard spellings, capitalizations, hyphenations and so on, may be found in source texts and propagate to mental lexicons. In practice, this will be rare, as long as good sources are used.
There will be great overlaps between mental lexicons of different users. This is more wasteful of course, but it makes for a true representation of language learning.
If we're going to make a high score list of users with most registered words, there will be no way to spot cheaters. The application will simply count registered words, without checking if the words are actual dictionary words.
Some users may feel it's anarchy to let each person make her own lexicon with no way of validating it. The idea is unique in CALL applications, and some users may have a hard time to accept or understand it.

gustafl commented 8 years ago

In summary, we need a central storage for user lexicons, but we don't need a collective lexicon. This issue can be closed, but we still need to define how words are saved, updated or loaded from the database.

gustafl / lexeme