inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.
https://inception-project.github.io
Apache License 2.0
593 stars 151 forks source link

Project Management #2117

Closed david-waterworth closed 2 years ago

david-waterworth commented 3 years ago

This is more a question of "best practice" than a feature request.

I'm annotating the sensor network of a large set of buildings. Each building has a large number of devices and each device has a reasonable (usually 10 or 20) number of associated points, I'm labelling the points and I've created a single xmi document per device - a few hundred per building and I have ~200 buildings.

One way of managing this is one project per building. But that has the apparent downside that you have many clones of the same project (layers etc) which will presumably get out of sync over time. But loading all the documents into the same project is a bit challenging to manage.

I thought I saw an XML element indicating something about document collections in an XMI file, I'm not sure that it used in inception but that could be a way of grouping a set of documents into a common task.

Is there something I'm missing which could make this easier?

reckart commented 3 years ago

The present recommendation would indeed be to use multiple projects. That said, we are looking into making it more feasible to work with larger amounts of documents per project. What aspects do you find challenging to manage?

The element you saw probably comes from the DKPro Core metadata information which has different fields to store where a document came from/belongs to. It is not strongly related to INCEpTION, although INCEpTION may at times store information in that element to get DKPro Core writers to do what they should.

reckart commented 3 years ago

So in total... how many documents (devices) do you have? At least, which order of magnitude? 10s of 1000s, 100s of 1000s, millions, more?

david-waterworth commented 3 years ago

Probably 10's of thousands. I think multiple projects is probably manageable when I think about it more. I'm not totally sure about the recommenders. I guess the internal recommenders won't share models between projects even if they have the same name? It looks like the external ones shouldn't (the train endpoint accepts a list of documents and a user_id/project_id) but the fuzzy string matcher at least doesn't use the project_id at the moment so it will keep overwriting the model if you switch from one to another.

My biggest challenge when I think about it is ensuring any external models use the same tokenisation as that used in the xmi file.