html - Githubissues

john-friedman / datamule-python

A package to work with SEC data. Incorporates datamule endpoints.

MIT License

72 stars 7 forks source link

html #17

Open firmai opened 2 days ago

firmai commented 2 days ago

I was wondering whether there is a functionality to not wipe all the html in the extraction process, for example, for the 10-ks it would be nice to know what is for example tables, lists, headings etc, this would give html tag information and probably some info about hierarchical relationships

There probably is also some benefit in getting the row_id, if ever it is used in some vectorised database, which most of the use cases is for, one would like to point back to where one got the text in the filing.

It would be awesome to get more hierarchy somehow out of the dataset ala these guys https://github.com/alphanome-ai/sec-parser

john-friedman commented 2 days ago

Hi @firmai, I'll be implementing table parsing in a future advanced parser.

It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source.

firmai commented 1 day ago

Hi @firmai, I'll be implementing table parsing in a future advanced parser.

It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source.

Why don't you call out for sponsorship on Linkedin, I will share it as far as I can, and can also contribute by January? Also if you need some help on this, will be happy to open an MS teams to colaborate, think this would be a really good addition to your open source package.

john-friedman commented 1 day ago

Hey @firmai - that's a great suggestion! Let me think about it and get back to you.