nlpaueb / edgar-crawler

The only open-source toolkit that can download EDGAR financial reports and extract textual data from specific item sections into nice and clean JSON files.
GNU General Public License v3.0
298 stars 83 forks source link

Great work! Any plan to download an entire set of 10Q textual data onto Huggingface (just like what you did on 10K)? #26

Closed hohoCode closed 21 hours ago

hohoCode commented 23 hours ago

Great work! Any plan to download an entire set of 10Q textual data onto Huggingface (just like what you did on 10K)?

If so, that would be great! Thanks!

eloukas commented 23 hours ago

Hi @hohoCode. To be honest, no. The 10-K collection was actually collected (and that's why the OSS started) in order to be able to create some NLP embeddings and unlabeled datasets to be used in some other models at that time.

May I ask what is your end goal (if you had a big 10Q corpus)? (In case you have a research direction in goal, I could maybe offer some advice. I have some ideas that I can not chase myself right now.)

hohoCode commented 22 hours ago

Just have a simple idea to have both 10K/Q for large model finetuning/pretraining, similar to your embedding idea just now for LLMs, if possible. Would love to hear yours too if any. Thanks

eloukas commented 21 hours ago

Ah, yeah. While this is not exactly what I had in mind, sure, I could chat with you. Feel free to message me on LinkedIn. I'm closing this issue for now!