Closed hohoCode closed 21 hours ago
Hi @hohoCode. To be honest, no. The 10-K collection was actually collected (and that's why the OSS started) in order to be able to create some NLP embeddings and unlabeled datasets to be used in some other models at that time.
May I ask what is your end goal (if you had a big 10Q corpus)? (In case you have a research direction in goal, I could maybe offer some advice. I have some ideas that I can not chase myself right now.)
Just have a simple idea to have both 10K/Q for large model finetuning/pretraining, similar to your embedding idea just now for LLMs, if possible. Would love to hear yours too if any. Thanks
Ah, yeah. While this is not exactly what I had in mind, sure, I could chat with you. Feel free to message me on LinkedIn. I'm closing this issue for now!
Great work! Any plan to download an entire set of 10Q textual data onto Huggingface (just like what you did on 10K)?
If so, that would be great! Thanks!