Open nkandpa2 opened 4 months ago
I believe that for most of the other sources I'm aware of, there are typically about 1/2 as many tokens as there are on-disk bytes of gzipped jsonl. In your case there are about 2x as many, any idea where this discrepancy comes from?
This PR closes #32. The data from 2000-2023 for the agencies listed below has been collected and is on HF here. The total amount of data is around 2B tokens and 1.3GB on disk.
Agencies: