Open firmai opened 2 days ago
Hi @firmai, I'll be implementing table parsing in a future advanced parser.
It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source.
Hi @firmai, I'll be implementing table parsing in a future advanced parser.
It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source.
Why don't you call out for sponsorship on Linkedin, I will share it as far as I can, and can also contribute by January? Also if you need some help on this, will be happy to open an MS teams to colaborate, think this would be a really good addition to your open source package.
Hey @firmai - that's a great suggestion! Let me think about it and get back to you.
I was wondering whether there is a functionality to not wipe all the html in the extraction process, for example, for the 10-ks it would be nice to know what is for example tables, lists, headings etc, this would give html tag information and probably some info about hierarchical relationships
There probably is also some benefit in getting the row_id, if ever it is used in some vectorised database, which most of the use cases is for, one would like to point back to where one got the text in the filing.
It would be awesome to get more hierarchy somehow out of the dataset ala these guys https://github.com/alphanome-ai/sec-parser