As a developer, I want a mechanism that will allow me to persist a task as multiple tasks.

georgeslabreche commented 7 years ago

Currently, each task is stored as a record in a tasks table. Each input for that task is saved as a record in the task_runs table which references the task record in the tasks table.

The tasks table has a column called "status" which determines whether it is ongoing or completed. This value is updated from ongoing to completed when the number of task run input redundancy is met.

When importing tasks, we need a mechanism that will split that tasks into n amount of sub task records in the task table. This splitting rule could be defined as annotations in the model? When we run the importer it will have to interpret those rules? How is the connection made between the task importer and the model? Maybe it's all one importer/model plugin?

Maybe we need developers to implement their own importers for every project? In which case we would implement the importer interface as its own plugin? http://docs.pybossa.com/en/latest/importers.html

georgeslabreche commented 7 years ago

If a task is split and persisted as n tasks records in the tasks table, then it seems like we might have to introduce a new column in that table in order to keep track of subtasks that form the one task they form.

If introducing a new column is too disruptive, this relationship could also be managed via a new table.

KrzysztofMadejski commented 7 years ago

Each input for that task is saved as a record in the task_runs table which references the task record in the tasks table.

Should we move that into mongo? or just the task run's data?

This value is updated from ongoing to completed when the number of task run input redundancy is met.

Nuance to clarify: task should have redundancy number of verified runs to be marked as complete, not total runs.

Maybe we need developers to implement their own importers for every project?

I think we should distinguish importing documents into Moonsheep (which is handled now by importers for various sources; BTW it would be good to port importers for maps from Amnesty and add ftp/http listings) from defining tasks that will be executed on those documents. Importers can be easily reused between projects. Tasks, in our complex scenarios, will probably have to be written from scratch.

When importing tasks, we need a mechanism that will split that tasks into n amount of sub task records in the task table. This splitting rule could be defined as annotations in the model? When we run the importer it will have to interpret those rules?

I would say it's optional. Either while importing we can create multiple tasks (ie. for asset declarations it could be: transcribe section A, transcribe section B,... ) or create one task per imported documents and let division happen after such task (name all sections in the declaration) is verified.

Splitting while importing could be annotated at the model level and document importer could have an option to choose model that the document corresponds to.

How is the connection made between the task importer and the model? Maybe it's all one importer/model plugin?

In my opinion: Various document importers as defined above should go to PyBossa core, but custom tasks dedicated to a model can be easily packed in one plugin (ie. support for hungarian asset declarations).

If a task is split and persisted as n tasks records in the tasks table, then it seems like we might have to introduce a new column in that table in order to keep track of subtasks that form the one task they form.

I've referenced to that in https://github.com/TransparenCEE/moonsheep/issues/82#issuecomment-316029187

Re merging: should we push data to structured storage as soon as they are verified (I'm leaning towards that) should we push data only after the whole "imported document"(do we want to keep such notion?) is verified [then we should keep child-parent relationships between tasks]

georgeslabreche commented 7 years ago

On moving to mongo: I have a lot of conflicting feelings about this. I want to do this but I feel that moving current task loading and serving logic into mongo is a project in itself! So much current mechanism is coupled with the relational schema. I definitely think we need to move to NoSQL with this but the work load involved may make this too ambitious if we also want to implement the new features we want. It seems we have to choose between a data model refactoring project or a new feature project on top of the current data model.

task should have redundancy number of verified runs to be marked as complete, not total runs.

This may require too much refactoring. Would we be OK with MVP having redundancy as just a limit of how many times we are willing to try to have the task verified?

KrzysztofMadejski commented 7 years ago

It seems we have to choose between a data model refactoring project or a new feature project on top of the current data model.

Let's go with new features then and save "Refactoring to Mongo" as a new separate epic in the icebox.

Would we be OK with MVP having redundancy as just a limit of how many times we are willing to try to have the task verified?

Let's go for it and save "task should have redundancy number of verified runs to be marked as complete" as a separate issue.

georgeslabreche commented 7 years ago

Ok!

themoonsheep / moonsheep

As a developer, I want a mechanism that will allow me to persist a task as multiple tasks. #83