Closed nkandpa2 closed 5 months ago
This is coo, but I'm not sure the ruling that posting transcripts is fair use implies that training on the transcripts is fair use or that the data itself is permissively licensed.
Since we are speculating about the license, let's close this for now and revisit if we find out this would have a satisfactory license.
Every quarter, publicly traded companies have an earnings call where they discuss the financial results of the quarter. According to this ruling, publication of earnings call transcripts are considered fair use since there's little copyrightable material in an earnings call.
However, the transcripts do seem to be a good source of high-quality text. For instance, see Apple's 2024 Q1 earnings call. A transcript contains quite a bit of factual information about a company, real-world events that impacted the company, financial information, and also contains Q&A dialogue.
Some napkin math: Transcripts on average seem to be about 10K tokens (checks out since it's usually a 30-60 min phone call), companies do them 4 times per year, and there are around 5k-10k publicly traded companies for which there are transcripts on the popular finance sites. Collecting 10 years worth of these transcripts would give 2-4B tokens of high-quality text.
Work would be needed to figure out the best way to collect this data. Also, it looks like some of these transcripts will be in the SEC data #31 but I have no clue whether this is all of them or just some one-off transcripts.