As we keep adding more documents for new projects, the repository is likely to
get too big.
To mitigate this, we can periodically scrub un-annotated documents for projects
that are not active anymore. For some projects 200 documents are added to the
repo but only a handful is annotated, so the other ones could be removed. The
repo's history would have to be rewritten as well for this to actually reduce
the repository size. Also we probably want to aim for a few active projects with
clear goals rather than a constellation of little projects with very few
annotations each.
If this is not sufficient to keep the repository size reasonable, as discussed IRL
with @Remi-Gau we could use git submodules and store each project in a separate
repository. Each repository would be small and annotators could clone only the
repository containing the project they are working on.
The downside is that for users or contributors who want the parts of the
repository that are independent from any project, such as the labelrepo package,
or the code and data to build the (jupyterbook) documentation, they would need
to use the git submodule commands which adds some friction.
Having the full repository (as it is now) is necessary for running analyses on
the annotations, and for even for annotating in the case of the
participant_demographics project because in that project annotating is made much
easier by using the watch_participants.py script in the /scripts/ directory,
which relies on labelrepo. labelrepo could be distributed from PyPI rather
than with the annotations, but it is useless without the annotations repo,
keeping it here means it's always in synch with the rest of the repo, and
installing it in editable mode provides a convenient way to find the location of
the repo in the filesystem without the user having to pass it on the command
line, export some env variable, or run the scripts from a specific working
directory.
As we keep adding more documents for new projects, the repository is likely to get too big.
To mitigate this, we can periodically scrub un-annotated documents for projects that are not active anymore. For some projects 200 documents are added to the repo but only a handful is annotated, so the other ones could be removed. The repo's history would have to be rewritten as well for this to actually reduce the repository size. Also we probably want to aim for a few active projects with clear goals rather than a constellation of little projects with very few annotations each.
If this is not sufficient to keep the repository size reasonable, as discussed IRL with @Remi-Gau we could use git submodules and store each project in a separate repository. Each repository would be small and annotators could clone only the repository containing the project they are working on.
The downside is that for users or contributors who want the parts of the repository that are independent from any project, such as the labelrepo package, or the code and data to build the (jupyterbook) documentation, they would need to use the git submodule commands which adds some friction.
Having the full repository (as it is now) is necessary for running analyses on the annotations, and for even for annotating in the case of the participant_demographics project because in that project annotating is made much easier by using the
watch_participants.py
script in the/scripts/
directory, which relies onlabelrepo
.labelrepo
could be distributed from PyPI rather than with the annotations, but it is useless without the annotations repo, keeping it here means it's always in synch with the rest of the repo, and installing it in editable mode provides a convenient way to find the location of the repo in the filesystem without the user having to pass it on the command line, export some env variable, or run the scripts from a specific working directory.