Increase input files extensions

pesap commented 6 years ago

Hey @josiahjohnston

I was wondering if we could add more extensions for the input files such as .csv, .tsv, etc. I think this will give more flexibility for some users of switch. I can do the pull request for this. It is an easy feature implementation.

bmaluenda commented 6 years ago

Hi

I imagine that by "adding more extensions" you mean not only to accept different extensions in the filename, but also to correctly parse these other data file formats, such as comma-separated values. If that is the case and you are willing to implement it, I say go ahead :) (I would try to review it)

It would be a nice addition, especially considering that most people are used to working with .csv and not with tab-separated values.

pesap commented 6 years ago

Yes. You are right. I meant to parse different file formats. I will do the pull request :) On Tue, Nov 28, 2017 at 9:37 AM Benjamin Maluenda notifications@github.com wrote:

Hi

I imagine that by "adding more extensions" you mean not only to accept different extensions in the filename, but also to correctly parse these other data file formats, such as comma-separated values. If that is the case and you are willing to implement it, I say go ahead :) (I would try to review it)

It would be a nice addition, especially considering that most people are used to working with .csv and not with tab-separated values.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/switch-model/switch/issues/100#issuecomment-347602620, or mute the thread https://github.com/notifications/unsubscribe-auth/ACIqFHxvn4WRUsBT6HYJovIP1zS2TOwTks5s7ETmgaJpZM4QpLnU .

-- Pedro Andrés Sánchez Pérez

josiahjohnston commented 6 years ago

Thanks Benjamin for being a better communicator than me! :)

The use of .tab files was one of the more established paths with the pyomo codebase, and is a hold-over from their initial desire to match the AMPL file formats, although the formats are increasingly diverging. The pyomo DataPortal interface seemed most flexible for our purposes, but it didn't have everything we wanted and we've already written a wrapper around it to provide more features. It can be slow at parsing and assembling data, so it's not ideal, but its also code we don't have to maintain. If you can write support for csv inputs files, that seems dandy; there's a chance DataPortal already supports it in an undocumented way, but I haven't looked into that yet.

As far as .tsv files go, I think they have the same conventions as a .tab file, but a different extension. If I'm correct, then you won't have to write any new code to support tsv; just pass a different file name to DataPortal.load()

If you have to write new code to implement support in general, I'd suggest using pandas for reading files from disk, then stuffing the data into the DataPortal dictionary. Pandas works efficiently with a wide variety of file formats, is fairly well known, and is maintained and expanded by a broad community. We'll need a little bit of glue code to link pandas to DataPortal, but that shouldn't be too lengthy or difficult to maintain. If you look under the hood, a DataPortal object has a massive nested dictionary that stores everything it has read in, and I've read some Pyomo documentation saying you can add to that dictionary directly as long as you follow their conventions. I think some of the code for parsing partial load heat rates already manipulates that dictionary directly.

Best of luck.

https://software.sandia.gov/downloads/pub/pyomo/PyomoOnlineDocs.html#_data_input

mfripp commented 5 years ago

I think we could pretty easily support any file format that Pyomo allows. I'm not sure if .tsv is on that list. But I would actually be more in favor of just standardizing on .csv for both input and output. A few reasons for this:

the standard Switch modules are going to read and create files with some extension (currently .tab, maybe eventually .csv), so users must have a workflow for creating these anyway (unless the idea is to search for generation_projects_info.*, in which case we need more discussion)
.tab files are not generally recognized by file viewers, .tsv is marginally better, but .csv support is quite widespread (e.g., double-click to open in Excel; use Mac quick viewers; use pandas.read_csv() with no extra arguments, etc.)
most programming text editors have a very hard time editing .tsv files or .tab files with tab separators (i.e., the tab key inserts spaces). This makes it hard to whip up a little demo model.
there is no way to have spaces within values in .tab files (quoting doesn't work). So if we use .tab files we can't have natural timestamp labels like "2040-12-03 01:00" (which can be easily interpreted by Excel, Pandas, etc.). These are fine in .csv files.

As an amendment to my first point: I'm working on code to allow users to specify aliases for any input file from the command line (or in scenarios.txt or options.txt), e.g., --alias gen_build_costs.tab=gen_build_costs_low.tab. This would be useful for running different scenarios, e.g., swap between high and low equipment or fuel prices. I haven't added it to the main codebase yet because it doesn't really extend our functionality, just reduces storage requirements (you can already run these other scenarios by creating input directories for each interesting permutation). But this could also be a way to support use of other file formats even in the standard modules, e.g., --alias gen_build_costs.tab=gen_build_costs_low.xlsx. So we could do both -- use .csv files by default, but allow users to specify other formats via an alias, and allow users to load data from any pyomo-supported file format in their own custom modules. We'd also need some command-line argument to specify the output format to use.

josiahjohnston commented 5 years ago

@mfripp I wrote a quick patch to support csv files (in addition to tab) (02aa13d509a08c5d937869e3bec9d1b53d9e4a3d), but didn't change the rest of the code. Not sure if that commit would be better off in the 2.0.1 branch or master. That idea could be extended to support xlsx; it just needs to customize the header parsing code. Supporting other file formats (or direct DB connections) would require more thought for how to allow optional columns.

Good points; a few comments:

Re: no space support. Yeah, Pyomo's tab parser sucks and uses a fragile home-brewed solution of tokens = re.split("[\t ]+",line) instead of python's standard file parser. Their csv parser at least uses the python csv.reader which supports quotes & spaces. This might be the single biggest reason to switch to CSV.
One major drawback with CSV is they can be a hassle in countries that use a comma instead of a dot for decimal points.
Good text editors provide an option of whether or not to expand tabs into spaces. Although, many users may not know how to easily access that option, so your general point on ease-of-use still stands.
Yeah, TAB is an unusual variant of TSV that isn't recognized by OS's. TSV is fully supported on my machine (Mac 10.14.1 "Mojave"), but Pyomo gives errors on TSV extensions.
Reducing disk requirements through aliasing seems like a nice feature. Have you looked into using namespaces to help manage multiple scenarios loaded at the same time in a python process? Seems like namespaces could be useful (and low-RAM usage) for some of your applications.

If we had reason to stick with tab-separated-value, our best bet might be to write a new pyomo data plugin called tsv_table.py that was almost identical to csv, but with a different separator. Then submit a pull request.

FTR, Pyomo's DataPortal now supports way more data formats than when we first wrote data loading code:

TAB File: A text file format that uses whitespace to separate columns of values in each row of a table.
- This reader expects a .tab instead of .tsv; asking DataPortal to parse a TSV file yields IOError: Unknown file format 'tsv'
CSV File: A text file format that uses comma or other delimiters to separate columns of values in each row of a table.
JSON File: A popular lightweight data-interchange format that is easily parsed.
YAML File: A human friendly data serialization standard.
XML File: An extensible markup language for documents and data structures. XML files can represent tabular data.
Excel File: A spreadsheet data format that is primarily used by the Microsoft Excel application.
Database: A relational database.
DAT File: A Pyomo data command file.

Also worth noting is documentation & official support for skipping DataPortal and directly using Python dictionaries.

josiahjohnston commented 5 years ago

Release 2.0.5 transitions all input & output files to .csv. Well, all outputs except the trivial total_cost.txt that stores a single number and the results.pickle file that stores the solution in binary format. This release should go out in the next 24-48 hours.

There's another option --input-aliases that will allow you to specify alternative names for expected input files. Matthias has that on a pre-release branch and plans to merge it in. That should allow you to use '.tab' input files instead of .csv, or .tsv if a future version of Pyomo supports them.

Any developer who wishes to use other input file formats for their modules may write new modules that use any allowed DataPortal format via standard calls to DataPortal.load on the switch_model DataPortal instance, rather than calling load_aug (our thing wrapper around load). If anyone has a use case for disabling our input method and using their own methods instead, please contact us for tips on how to go about doing that.

@pesap Does this address your issue? Please reply in the next month or two, or we may close this issue as part of housekeeping.

Cheers, -Josiah

switch-model / switch

Increase input files extensions #100