Importing from weird text formats

ropensci / unconf16

rOpenSci's San Francisco hackathon/unconf 2016

http://unconf16.ropensci.org

24 stars 7 forks source link

Importing from weird text formats #20

Open leeper opened 8 years ago

leeper commented 8 years ago

A substantial proportion of questions on StackOverflow are about how to read in data from weird text formats that aren't covered by the usual functionality. Sometimes these are just fixed-width files that users aren't familiar with or a slightly malformed TSV, but other times they're things like one of various flavors of markdown table, MediaWiki tables, or something else.

Lots of data is stored in these kinds of formats (e.g., on Wikipedia) but is locked up by the difficult-to-parse format.

Can we invent some functionality for parsing these formats and turning them into a data.frame?

noamross commented 8 years ago

Most of these formats are handled by pandoc. A relatively easy way to handle this task may be to use a pandoc wrapper to convert the document to pandoc's native JSON format, and then write a function to import and convert that JSON. It may be even less work to convert to HTML and use rvest or something similar to import the HTML tables.

richfitz commented 8 years ago

Jeroen has wrapped a commonmark parser, which would avoid the (potentially lossy) html transition for md tables: https://github.com/jeroenooms/commonmark

Possibly one can get something like a parse tree out of pandoc too, I don't know.

sckott commented 8 years ago

@leeper does it make sense since there are a variety of different formats to have a suite of recipes (scripts) for parsing weird/odd formats, some of which may use pkg X and others pkg Y + Z, rather than a pkg, which may be spread very thin b/c of many diff. dependencies (and possibly heavy ones like pandoc)

leeper commented 8 years ago

@sckott I think that's a great idea! If we had one go-to place to show strategies for reading in data, that could be really useful. Maybe it's even just creating some StackOverflow r-faqs with clear and somewhat general tutorials.

noamross commented 8 years ago

I needed this so I tried doing my suggestion above and building a package wrapping pandoc to convert formats to HTML and then importing via rvest::html_table: https://github.com/noamross/texttable

Unfortunately there are a lot of formats which pandoc's table readers are wonky, and you don't get good HTML tables in the output at the moment. But it works for markdown, docx, org-mode, textile, and some others.

@daattali You might be interested in this, too.

daattali commented 8 years ago

Thanks @noamross