themoonsheep / moonsheep

Moonsheep digitizes huge, messy paper and PDF archives through crowdsourcing and cutting edge technology.
http://moonsheep.org
GNU Affero General Public License v3.0
9 stars 3 forks source link

As a developer, I want to define validation rules that need to pass before a task is removed from the task pool. #84

Closed georgeslabreche closed 6 years ago

georgeslabreche commented 7 years ago

Pybossa's built in validation is purely based on redundancy, we need to implemented support for custom validations. For instance, the ability to indicate that some fields need to have equal values inputted at least n times.

TODO:

TECHNICAL NOTE: I only see two ways of implementing this:

  1. Validation is made every time data is persisted. To achieve this we need a trigger mechanism that will run custom validation functions. Where do we want to support this callback mechanism? Do we define a validation interface that needs to be implemented in the back-end for every model & task importer plugin? Or do we leverage PostgreSQL's trigger's mechanism and PL/pgSQL? It feels like this would introduce performance bottle neck.
  2. We have a cron job that checks all data. Not sure how scalable this is or how we could frame it within strict implementation guidelines. Perhaps define cronjob interfaces? PyBossa actually has a Jobs module for running background tasks in the PyBossa server (e.g. jobs include updating the stats page and sending out e-mails). Check out shed.py and jobs.py for technical implementation of PyBossa's Jobs module. The con with this approach is that task processing completion won't only depend on how many people are submitting tasks but also on how often the cronjob runs.
KrzysztofMadejski commented 7 years ago

List out all the validation we want to support.

https://github.com/transparencee/moonsheep/issues/73

Figure out how to reconcile with PyBossa's redundancy mechanism. Maybe the redundancy is just maximum of entries that tells the system when to to give up on validating with the rules.

That's a neat idea. Such fuzzy tasks could be marked for inspection by a moderator. Nevertheless it would be good to have a default redundancy for verified fields, not to repeat that number in all of the validation rules. I would keep PyBossa's redundancy limit as this default value for number of verified entries and introduce a new limit for maximum number of entries.

Thinking about redundancy for specific model fields - it may be too hard on performance. Maybe defining custom redundancy limit per task would be enough?

Validation is made every time data is persisted.

I'd go for that option, but queue such requests in a cron/ongoing job not to kill the database.

Do we define a validation interface that needs to be implemented in the back-end for every model & task importer plugin?

I'd do it like this. See https://github.com/transparencee/moonsheep/issues/73#issuecomment-308158913: Have a pluggable interface for verification rules

The algorithm for completing a task run would be:

Verification loop handling to_be_verified queue:

The con with this approach [cron] is that task processing completion won't only depend on how many people are submitting tasks but also on how often the cronjob runs.

You can make cron run quite often [5mins? 1min?], in one thread and not quit until it processed the whole queue. To have an continuous job sort of things. The questions is do we want to parallelize validation? I guess not in an MVP.

georgeslabreche commented 7 years ago

You and I need to experiment with PyBossa'a cron architecture. I had a really hard time with it when I tried to extend. Plus it didn't seem to like short period crons. I eventually abandoned that quest so it concerns me having to revisit it. The off the shelf crons have always been problematic for me, requiring restarts from time to time.

KrzysztofMadejski commented 7 years ago

Ckan (Python based) uses Celery tasks with Redis storage, continuous work ensured by supervisor. Works like a charm in a Polish instance. See this wiki for an installation plus configuration templates.

I can revisit PyBossa crons, just create a task and assign it to me.

georgeslabreche commented 7 years ago

Will do. In the immediacy you can snope around those sched.py and jobs.py files I linked in the issue description. PyBossa also uses Redis for caching.

KrzysztofMadejski commented 6 years ago

To note: I've heard people recommending rabbitMQ as a message broker and PythonRQ as a asynchronous task leader instead of celery.