sebastian-c / overflow

An R package to assist people answering R questions on Stack Overflow
14 stars 12 forks source link

Sister function to readSO to allow reading of fixed width data posted at SO #17

Open mrdwab opened 11 years ago

mrdwab commented 11 years ago

Sometimes data is pasted to Stack Overflow in a fixed width format like the following:

NYC1       1/1/2013      2/1/2013      Open
NYC1       2/2/2013      2/3/2013      Closed for Inspection
Boston1    1/1/2013      2/5/2013      Open

readSO cannot correctly read such data because of the spaces in the fourth column.

readSOfwf() Gist of rough concept here attempts to address this problem in a somewhat crude way:

This won't work, though, with cases like the data shared in this question. In that example, it will simply return a single column.

sebastian-c commented 11 years ago

I can't quite see the purpose of the dropFirst argument. More precisely I can't see why you'd want to drop that line specifically. I like the idea of the function and it looks to me like it could be integrated into the regular readSO function with a type parameter (which must override sep, I suppose).

mrdwab commented 11 years ago

The dropFirst argument is to drop the first column. This is useful if the first column is just the row names. The way this function is currently written (that is, because of how it tries to identify where the columns are), row names (which are usually just the row numbers) are seen as another column. From the test cases, you would be able to see its effect on the third and fourth cases (which are more appropriate to readSO anyway).

I could definitely integrate this into the regular readSO with a little bit of work, adding a type argument where fixed or fwf can be specified.

Actually... perhaps sep = "fwf" can be an option?

sebastian-c commented 11 years ago

Sure, something like:

if(sep="fwf"){
    readfwf()
} else {
readasnormal()
}
mrdwab commented 11 years ago

Looking at the function again, its performance is not up to par. There are a lot of cases where printed data includes only one space between columns, so the current approach doesn't work very reliably.

I'll try something with maybe a dummy matrix of where there are spaces, and any "columns" of zeroes for spaces that match the number of rows in the dataset can be assumed to be the position for a column break. It'll probably be at least a few days before I revisit this though.

sebastian-c commented 11 years ago

Where are the performance hits? It looks to me like you could condense a lot of the regexes and given that many of them appear on the beginning on only one spot, replace gsub with sub (although if I remember correctly, that's not as much as a performance improvement as the one between gregexpr and regexpr)..

mrdwab commented 11 years ago

Sorry, I didn't communicate the problem clearly. "Performance" here isn't about efficiency, particularly since these functions (readSO, readSOfwf) (somewhat by design) are only meant to tackle small problems. "Performance" here means "not getting the desired result".

Take the following example (which is not unreasonable for R printout):

0 2000 val ues
1 2000 valuess
2 2000       b
3 2000       d

The way readSOfwf is currently written, this will all be seen as a single column.

sebastian-c commented 11 years ago

You could try assuming the number of fields from the header. In your case, I think it's not a good idea to have spaces in the header so I'd edit the question to remove all spaces from headers.