smutt / icann-dl

A script to find and download PDF publications from www.icann.org
GNU General Public License v3.0
3 stars 2 forks source link

Proposing two scripts, one to generate txts from PDFs and another to create a tsv file compatible with https://cloud.google.com/storage-transfer/docs/create-url-list from all or a subset of files. #9

Open kakooch opened 1 year ago

kakooch commented 1 year ago

I have used these to fine tune BLOOM and also have generated ada embeddings for all of these, linked to all RFCs. Working on a query iface as we speak. Will later include scripts which scrape mailing lists archives and will embed and index them properly. Will share the rest asap.