Open piconti opened 6 months ago
It is not clear exactly which command should have been used for the actual detection. Two text-reuse detections were launched, with different configurations:
1st configuration (similar to the one for boilerplate):
SPARK_SUBMIT_ARGS='—master local[25] —driver-memory 200G —executor-memory 200G —conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim -w 4 —schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" —fields 'date(date) as day' —pairwise —output-format json —filterpairs 'day < day2 AND datediff(day2,day) < 32 AND gid = gid2 AND uid <> uid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output/"
2nd configuration launched (matches the one from 2019 currently in the S3):
SPARK_SUBMIT_ARGS='--master local[30] --driver-memory 200G --executor-memory 200G --conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim --schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" --output-format json --filterpairs 'gid < gid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output_conf2/"
The results of both will be postprocessed and compared to see which match most the previous results and/or are the best.
Using the python version has not been experimented with. This could be done for the next release potentially, where another approach will probably need to be devised due to the very large amount of new data coming in.
Text-reuse detection with Passim version 1 takes place in multiple steps:
Some of these steps have already been done, others are in the process. Among the important steps is adapting a lot of the code that was previously run with dask-kubernetes to be run on runai. This means creating a docker image and scripts that perform all the various steps.