togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

What is the output of `run_lsh.py`? #96

Closed virendrakabra14 closed 5 months ago

virendrakabra14 commented 5 months ago

I was going over the code and ran it to dedupe 2 shards.

The output parquet files has much less rows than the input min-hashes files (e.g. 26k -> 300). What are these? It doesn't look like these are just the duplicates within/across the 2 shards, as some cluster_ids occurred uniquely over the 2 shard-clusters that LSH script output.

Jdemon233 commented 4 months ago

Hi,I have run this scripts , i don't get the .jsonl files as i expected, i want to ask your result after processing.Is the output data stored somewhere else?

virendrakabra14 commented 4 months ago

Is the output data stored somewhere else?

The clusters are stored in .clusters.parquet files.

Jdemon233 commented 4 months ago

Is the output data stored somewhere else?

The clusters are stored in .clusters.parquet files.

I know that. I want to know the location of the text data after fuzzy-deduplicating.I don't see the clear result yet.

virendrakabra14 commented 4 months ago

I want to know the location of the text data after fuzzy-deduplicating

I think the expected use is to filter out those duplicate doc IDs from the original set.

Jdemon233 commented 4 months ago

I want to know the location of the text data after fuzzy-deduplicating

I think the expected use is to filter out those duplicate doc IDs from the original set.

so i need extra coding to get the deduplicated text data, how do you move forward you data-cleaning after running this scripts?

virendrakabra14 commented 4 months ago

The files output by this script contain IDs of docs that are part of some cluster. So ignore these docs (except when doc ID == cluster ID, so you retain one doc per cluster).

mauriceweber commented 4 months ago

The files output by this script contain IDs of docs that are part of some cluster. So ignore these docs (except when doc ID == cluster ID, so you retain one doc per cluster).

this is exactly right -- once you ran the lsh step, you end up with files that contain the cluster ids. You can then use those to retain only one sample from each cluster. One way is to just keep the document with doc ID == cluster ID, but you can in principle also retain a document which has e.g., best quality and use the quality signals in RPv2 to select the best document within each cluster.

Jdemon233 commented 4 months ago

The files output by this script contain IDs of docs that are part of some cluster. So ignore these docs (except when doc ID == cluster ID, so you retain one doc per cluster).

thanks, i resolved my problem