litmining / labelbuddy-annotations

Annotations of academic papers. The papers were likely gathered using pubget, and the annotations made with labelbuddy
https://litmining.github.io/labelbuddy-annotations/
MIT License
11 stars 16 forks source link

Managing repository size #30

Open jeromedockes opened 1 year ago

jeromedockes commented 1 year ago

As we keep adding more documents for new projects, the repository is likely to get too big.

To mitigate this, we can periodically scrub un-annotated documents for projects that are not active anymore. For some projects 200 documents are added to the repo but only a handful is annotated, so the other ones could be removed. The repo's history would have to be rewritten as well for this to actually reduce the repository size. Also we probably want to aim for a few active projects with clear goals rather than a constellation of little projects with very few annotations each.

If this is not sufficient to keep the repository size reasonable, as discussed IRL with @Remi-Gau we could use git submodules and store each project in a separate repository. Each repository would be small and annotators could clone only the repository containing the project they are working on.

The downside is that for users or contributors who want the parts of the repository that are independent from any project, such as the labelrepo package, or the code and data to build the (jupyterbook) documentation, they would need to use the git submodule commands which adds some friction.

Having the full repository (as it is now) is necessary for running analyses on the annotations, and for even for annotating in the case of the participant_demographics project because in that project annotating is made much easier by using the watch_participants.py script in the /scripts/ directory, which relies on labelrepo. labelrepo could be distributed from PyPI rather than with the annotations, but it is useless without the annotations repo, keeping it here means it's always in synch with the rest of the repo, and installing it in editable mode provides a convenient way to find the location of the repo in the filesystem without the user having to pass it on the command line, export some env variable, or run the scripts from a specific working directory.