Closed rom1504 closed 1 year ago
300k samples per shard prededup that means 1.5T samples prededup
out of 130k wat: 40B before dedup 8B after dedup
10% of CC : 500 wat in 5 parts : 6B per part 1TB, total 20B, 3.2TB
took 6h for processing and 3h for dedup. Dedup speed could be improved (see #14 )
will add more numbers in readme later
running full scale requires 95TB and produces 5M files
then need dedup and repartitioning