Prepare and launch text-reuse detection with Passim

piconti commented 6 months ago

Text-reuse detection with Passim version 1 takes place in multiple steps:

[x] Rebuild all canonical data in the passim-rebuilt format (rebuilt adapted for passim's formatting needs)
[x] Install all necessary tools based on the PH tutorial
[x] Launch the boilerplate detection on all data which contains OLR (article-level segmentation)
[x] Process this first output to create the pb.pkl dataframe containing the ids of all boileerplate articles
[x] Remove the identified boilerplate articles from the considered data, and prepare for the actual text-reuse detection
[x] Launch the text-reuse detection process
[x] Post-process the output (another issue: #6).
[ ] Also experiment with the updated python version to compare results (another issue: #7)
[ ] Document the updated process (another issue: #8)

Some of these steps have already been done, others are in the process. Among the important steps is adapting a lot of the code that was previously run with dask-kubernetes to be run on runai. This means creating a docker image and scripts that perform all the various steps.

piconti commented 5 months ago

It is not clear exactly which command should have been used for the actual detection. Two text-reuse detections were launched, with different configurations:

1st configuration (similar to the one for boilerplate): SPARK_SUBMIT_ARGS='—master local[25] —driver-memory 200G —executor-memory 200G —conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim -w 4 —schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" —fields 'date(date) as day' —pairwise —output-format json —filterpairs 'day < day2 AND datediff(day2,day) < 32 AND gid = gid2 AND uid <> uid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output/"

2nd configuration launched (matches the one from 2019 currently in the S3): SPARK_SUBMIT_ARGS='--master local[30] --driver-memory 200G --executor-memory 200G --conf spark.local.dir=/scratch/piconti/impresso/spark-tmp/' passim --schema-path="/home/piconti/impresso-passim/sample_data/passim.schema" --output-format json --filterpairs 'gid < gid2' "/scratch/piconti/impresso/text_reuse/rebuilt_data/*.jsonl.bz2" "/scratch/piconti/impresso/text_reuse/passim_output_conf2/"

The results of both will be postprocessed and compared to see which match most the previous results and/or are the best.

piconti commented 3 days ago

Using the python version has not been experimented with. This could be done for the next release potentially, where another approach will probably need to be devised due to the very large amount of new data coming in.

impresso / impresso-passim

Prepare and launch text-reuse detection with Passim #5