impresso / impresso-passim

This repository contains code and sample data related to running the impresso corpus through the text reuse detection software passim.
GNU Affero General Public License v3.0
0 stars 0 forks source link

Prepare and launch text-reuse detection with Passim #5

Open piconti opened 6 months ago

piconti commented 6 months ago

Text-reuse detection with Passim version 1 takes place in multiple steps:

Some of these steps have already been done, others are in the process. Among the important steps is adapting a lot of the code that was previously run with dask-kubernetes to be run on runai. This means creating a docker image and scripts that perform all the various steps.

piconti commented 5 months ago

It is not clear exactly which command should have been used for the actual detection. Two text-reuse detections were launched, with different configurations:

1st configuration (similar to the one for boilerplate): SPARK_SUBMIT_ARGS='—master local[25] —driver-memory 200G —executor-memory 200G —conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim -w 4 —schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" —fields 'date(date) as day' —pairwise —output-format json —filterpairs 'day < day2 AND datediff(day2,day) < 32 AND gid = gid2 AND uid <> uid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output/"

2nd configuration launched (matches the one from 2019 currently in the S3): SPARK_SUBMIT_ARGS='--master local[30] --driver-memory 200G --executor-memory 200G --conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim --schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" --output-format json --filterpairs 'gid < gid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output_conf2/"

The results of both will be postprocessed and compared to see which match most the previous results and/or are the best.

piconti commented 3 days ago

Using the python version has not been experimented with. This could be done for the next release potentially, where another approach will probably need to be devised due to the very large amount of new data coming in.