go-gota / gota

Gota: DataFrames and data wrangling in Go (Golang)
Other
2.98k stars 276 forks source link

Would this package work with inconsistently delimited files? #113

Closed MostHated closed 4 years ago

MostHated commented 4 years ago

Hey there, I have a folder full of files and unfortunately, the application that generates them, for some reason doesn't keep things extremely consistent. Examples will be below. From what I can tell, the first 3 lines are always comments, then the next section starting with HCONTEXT, there might be just one or there might be several. Then there are not always additional comments before you get to the sets of data, but in the second example, there are. The sets of data are always laid out the same with the first column being an application symbol, label, description, and then the last one is a list of 0 to N single key or key combinations (alt+z, or ctrl+t, etc) which are delimited by a space.

The main issue is the delimiter between the four columns are not consistent. Their layout of data is always the same, but to delimit the text, some might have a single tab (\t), some might have two, some might have three, or a single \t and a space (\s), two spaces and a tab (\s\t\s, or \s\s\t), etc.

If someone would not mind letting me know if this library is able to help me out with this, I would greatly appreciate it. If not, does anyone happen to know of one that might? I was not exactly sure what search terms to use when looking, I tried "parse text", "csv", "multiple delimiters", and various other things. Unless I need to just go and use multiple libraries and do it in different steps, I am hoping to keep it as absolutely performant as possible though at runtime.

Thanks! -MH

//
// Desktop manager (separate app)
//

HCONTEXT deskmgr "Desktop Manager" "These keys are used in the Desktop Manager dialog."

deskmgr.new     "New"       "Create a new desktop"      Alt+N N
deskmgr.add     "Add"       "Add a desktop"         Alt+D D
deskmgr.apply       "Apply"     "Apply current changes"
deskmgr.accept      "Accept"    "Accept current changes"
deskmgr.discard     "Discard"   "Discard current changes"
deskmgr.reload      "Reload"    "Reload the desktops"
deskmgr.refresh     "Refresh"   "Refresh the desktops"
deskmgr.save        "Save"      "Save current changes"      Alt+S S
deskmgr.cancel      "Cancel"    "Cancel current changes"    Esc
//
// Gplay hotkeys
//

HCONTEXT gplay "GPLAY Geometry Viewer" "These keys apply to the Geometry Viewer application."

// File menu
gplay.open      "Open"          "Open"          Alt+O Ctrl+O
gplay.quit      "Quit"          "Quit"          Alt+Q Ctrl+Q

// Display menu
gplay.display_info  "Geometry Info"     "Geometry Info"     Alt+I
gplay.unpack        "Unpack Geometry"   "Unpack Geometry"   Alt+U
gplay.display_ssheet    "Geometry Spreadsheet"  "Geometry Speadsheet"   Alt+S
gplay.flipbook      "Flipbook Current Viewport" "Flipbook the currently selected viewport"  Alt+F
gplay.display_prefs "Preferences"       "Preferences"       

// Help menu
gplay.help_menu     "Help Menu"     "Help Menu"     Alt+H

// Commands not in menus
gplay.quick_quit    "Quick Quit"        "Quick Quit"        Q
gplay.next_geo      "Next Geometry"     "Next Geometry"     N
gplay.prev_geo      "Previous Geometry" "Previous Geometry" P
gplay.stop_play     "Stop Play"     "Stop Play"     Space
kniren commented 4 years ago

No matter what you do, I don't think any package will be able to load this as a data frame right out of the box.

My suggestion is that you clean up your data first using regular expression substitution to remove comments, newlines and non-tabular data. With this, you could also write a regex to prepare all the files to use the same delimiter.

I guess you could do this in Go directly or using sed, awk and UNIX pipes.

Good luck!

MostHated commented 4 years ago

Thanks for the suggestion. I was able to do just that.