shaigue / pmi_masking

This repository contains code that takes a text corpus and creates a PMI masking vocabulary for it.
MIT License
1 stars 0 forks source link

create pmi masking vocabulary for RedPajama #11

Open shaigue opened 1 year ago

shaigue commented 1 year ago

https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

this will be the real scalability test :)

I estimate that ~5TB of disk could be enough.