Open mrdwab opened 11 years ago
I can't quite see the purpose of the dropFirst
argument. More precisely I can't see why you'd want to drop that line specifically.
I like the idea of the function and it looks to me like it could be integrated into the regular readSO
function with a type
parameter (which must override sep
, I suppose).
The dropFirst
argument is to drop the first column. This is useful if the first column is just the row names. The way this function is currently written (that is, because of how it tries to identify where the columns are), row names (which are usually just the row numbers) are seen as another column. From the test cases, you would be able to see its effect on the third and fourth cases (which are more appropriate to readSO
anyway).
I could definitely integrate this into the regular readSO
with a little bit of work, adding a type
argument where fixed
or fwf
can be specified.
Actually... perhaps sep = "fwf"
can be an option?
Sure, something like:
if(sep="fwf"){
readfwf()
} else {
readasnormal()
}
Looking at the function again, its performance is not up to par. There are a lot of cases where printed data includes only one space between columns, so the current approach doesn't work very reliably.
I'll try something with maybe a dummy matrix of where there are spaces, and any "columns" of zeroes for spaces that match the number of rows in the dataset can be assumed to be the position for a column break. It'll probably be at least a few days before I revisit this though.
Where are the performance hits? It looks to me like you could condense a lot of the regexes and given that many of them appear on the beginning on only one spot, replace gsub
with sub
(although if I remember correctly, that's not as much as a performance improvement as the one between gregexpr
and regexpr
)..
Sorry, I didn't communicate the problem clearly. "Performance" here isn't about efficiency, particularly since these functions (readSO
, readSOfwf
) (somewhat by design) are only meant to tackle small problems. "Performance" here means "not getting the desired result".
Take the following example (which is not unreasonable for R printout):
0 2000 val ues
1 2000 valuess
2 2000 b
3 2000 d
The way readSOfwf
is currently written, this will all be seen as a single column.
You could try assuming the number of fields from the header. In your case, I think it's not a good idea to have spaces in the header so I'd edit the question to remove all spaces from headers.
Sometimes data is pasted to Stack Overflow in a fixed width format like the following:
readSO
cannot correctly read such data because of the spaces in the fourth column.readSOfwf()
Gist of rough concept here attempts to address this problem in a somewhat crude way:This won't work, though, with cases like the data shared in this question. In that example, it will simply return a single column.