Closed virendrakabra14 closed 5 months ago
Hi,I have run this scripts , i don't get the .jsonl
files as i expected, i want to ask your result after processing.Is the output data stored somewhere else?
Is the output data stored somewhere else?
The clusters are stored in .clusters.parquet
files.
Is the output data stored somewhere else?
The clusters are stored in
.clusters.parquet
files.
I know that. I want to know the location of the text data after fuzzy-deduplicating.I don't see the clear result yet.
I want to know the location of the text data after fuzzy-deduplicating
I think the expected use is to filter out those duplicate doc IDs from the original set.
I want to know the location of the text data after fuzzy-deduplicating
I think the expected use is to filter out those duplicate doc IDs from the original set.
so i need extra coding to get the deduplicated text data, how do you move forward you data-cleaning after running this scripts?
The files output by this script contain IDs of docs that are part of some cluster. So ignore these docs (except when doc ID == cluster ID, so you retain one doc per cluster).
The files output by this script contain IDs of docs that are part of some cluster. So ignore these docs (except when doc ID == cluster ID, so you retain one doc per cluster).
this is exactly right -- once you ran the lsh step, you end up with files that contain the cluster ids. You can then use those to retain only one sample from each cluster. One way is to just keep the document with doc ID == cluster ID, but you can in principle also retain a document which has e.g., best quality and use the quality signals in RPv2 to select the best document within each cluster.
The files output by this script contain IDs of docs that are part of some cluster. So ignore these docs (except when doc ID == cluster ID, so you retain one doc per cluster).
thanks, i resolved my problem
I was going over the code and ran it to dedupe 2 shards.
The output parquet files has much less rows than the input min-hashes files (e.g. 26k -> 300). What are these? It doesn't look like these are just the duplicates within/across the 2 shards, as some
cluster_id
s occurred uniquely over the 2 shard-clusters that LSH script output.