Once the full text-reuse clusters have been generated and all works as intended with passim v1, it would be interesting to also perform this detection with the new python version v2, because:
Python version v2 is more recent and currently being maintained, while v1 is old and not maintained anymore
Staying on python instead of having various new dependencies with java, spark, scala etc is simpler in terms of the project's sustainability
The python version does not require the first step of boilerplate detection, which could mean a much faster process.
Hence, based on the results, it might be relevant and useful to switch to the python version in the mid-long-term.
The action points are:
[ ] Recompute the text-reuse with the updated version
[ ] Compute statistics & visualizations to compare the results with the old version computed on new data
[ ] Optionally change the approach for future text-reuse processings and/or look into how to match the scala performance with the python version
[ ] Document process when using the python version
Once the full text-reuse clusters have been generated and all works as intended with passim v1, it would be interesting to also perform this detection with the new python version v2, because:
Hence, based on the results, it might be relevant and useful to switch to the python version in the mid-long-term.
The action points are: