Make parse_filing function support html-wrapped text filings

Hi, Micah i detected another issue, in parse_filing function, i understand it will split the content mainly based on parent nodes such as

, however, it cannot parse the children nodes such as , then the item and part cannot be recognized correctly. So the ideal solution would be parse all nodes (including children nodes) to make the parse function as loose as possible otherwise we could miss quite some information.

Here is the example: https://www.sec.gov/Archives/edgar/data/1424844/000092290708000774/form10k_122308.htm

thanks in advance!

Regards Derek

Derek -

Thanks for pointing this out! Unfortunately, this is a bit of a larger issue that may take a bit longer to fix.

The problem is that the filing you pointed out is in the transition period from text-only filings to html. As a result, while it appears to be a html file, it is really a pure text filing underneath, lacking the structure of more recent filings that edgarWebR uses for parsing and detection.

Making this particular filing parse successfully requires a few new features -

Parsing of text filings - The code was architected to support this, but hasn't been implemented yet
Detection of html-wrapped text filing - I'm not sure how common this particular format of filing is, so it requires some method of detection. I think I can check for the 'pre' tag and have it be successful.
Conversion of html-wrapped text to plain text - So it can be handed to the appropriate parser.

None of this is hard, it will just take some time. On the upside, it will mean the ability to parse filings in text, something that has been on the todo list.

I'll do some exploration and see about when I think it could be done later this week.

Thanks and sorry this won't be a quick fix!

Hi, Micah

thx for pointing out the issue and giving some useful insights. I also think parsing pure text file should not be difficult as your package already has the logic behind it. The only tricky thing, as you mentioned, is to detect the html-wrapped text filing and convert it back to plain text. Good luck with the fix, and looking forward for an updated version!

Thx in advance!

Regards Derek

Some implementation notes for future reference...

Looks like detection can be based on looking at contents of either '//body' or '//text'. html-wrapped text only has 'pre', 'hr' and 'title' nodes. There are text nodes, but they are all line returns/spaces (may need to be removed first).

This issue has layers...

I've got the parsing working, but now the table of contents is throwing things off.

Work continues...

Thx for the update, good luck!

Cheers Derek

On Tue, 12 Dec 2017 at 23:04, Micah J Waldstein notifications@github.com wrote:

This issue has layers...

I've got the parsing working, but now the table of contents is throwing things off.

Work continues...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mwaldstein/edgarWebR/issues/3#issuecomment-351210154, or mute the thread https://github.com/notifications/unsubscribe-auth/AZkQ2S6JfzZUypHzohD4RnDFo7eeBmRaks5s_vh3gaJpZM4Q19E1 .

Newest version in git I believe works as expected with this filing. Try it out and let me know if it is working for you.

In version 0.3.0 just pushed to CRAN

Thx for the effort and update!

I will let you know.

Cheers Derek

Sent from Mail for Windows 10

From: Micah J Waldstein Sent: Friday, 22 December 2017 21:39 To: mwaldstein/edgarWebR Cc: DerekGeng; Author Subject: Re: [mwaldstein/edgarWebR] Make parse_filing function supporthtml-wrapped text filings (#3)

In version 0.3.0 just pushed to CRAN — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

mwaldstein / edgarWebR

Make parse_filing function support html-wrapped text filings #3