Closed DerekGeng closed 6 years ago
Derek -
Thanks for pointing this out! Unfortunately, this is a bit of a larger issue that may take a bit longer to fix.
The problem is that the filing you pointed out is in the transition period from text-only filings to html. As a result, while it appears to be a html file, it is really a pure text filing underneath, lacking the structure of more recent filings that edgarWebR uses for parsing and detection.
Making this particular filing parse successfully requires a few new features -
None of this is hard, it will just take some time. On the upside, it will mean the ability to parse filings in text, something that has been on the todo list.
I'll do some exploration and see about when I think it could be done later this week.
Thanks and sorry this won't be a quick fix!
Hi, Micah
thx for pointing out the issue and giving some useful insights. I also think parsing pure text file should not be difficult as your package already has the logic behind it. The only tricky thing, as you mentioned, is to detect the html-wrapped text filing and convert it back to plain text. Good luck with the fix, and looking forward for an updated version!
Thx in advance!
Regards Derek
Some implementation notes for future reference...
Looks like detection can be based on looking at contents of either '//body' or '//text'. html-wrapped text only has 'pre', 'hr' and 'title' nodes. There are text nodes, but they are all line returns/spaces (may need to be removed first).
This issue has layers...
I've got the parsing working, but now the table of contents is throwing things off.
Work continues...
Thx for the update, good luck!
Cheers Derek
On Tue, 12 Dec 2017 at 23:04, Micah J Waldstein notifications@github.com wrote:
This issue has layers...
I've got the parsing working, but now the table of contents is throwing things off.
Work continues...
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mwaldstein/edgarWebR/issues/3#issuecomment-351210154, or mute the thread https://github.com/notifications/unsubscribe-auth/AZkQ2S6JfzZUypHzohD4RnDFo7eeBmRaks5s_vh3gaJpZM4Q19E1 .
Newest version in git I believe works as expected with this filing. Try it out and let me know if it is working for you.
In version 0.3.0 just pushed to CRAN
Thx for the effort and update!
I will let you know.
Cheers Derek
Sent from Mail for Windows 10
From: Micah J Waldstein Sent: Friday, 22 December 2017 21:39 To: mwaldstein/edgarWebR Cc: DerekGeng; Author Subject: Re: [mwaldstein/edgarWebR] Make parse_filing function supporthtml-wrapped text filings (#3)
In version 0.3.0 just pushed to CRAN — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi, Micah i detected another issue, in parse_filing function, i understand it will split the content mainly based on parent nodes such as
, however, it cannot parse the children nodes such as , then the item and part cannot be recognized correctly. So the ideal solution would be parse all nodes (including children nodes) to make the parse function as loose as possible otherwise we could miss quite some information.
Here is the example: https://www.sec.gov/Archives/edgar/data/1424844/000092290708000774/form10k_122308.htm
thanks in advance!
Regards Derek