Parsing Strategies for Green Button XML

Hi - I'm interested in getting your input on parsing strategies for Green Button XML files. I'm considering starting a ruby implementation of a Green Button parser gem, in order to provide developers a faster on ramp to developing apps. I think your input would be really helpful, since you've done it once. Would you be interested in chatting sometime in the next week or two?

Charles Presidential Innovation Fellow @ DOE working on Open Data / Green Button charles.worthington@ee.doe.gov

Hi Charles-

I think the main trick with gbXML is that there are a lot of different and quirky implementations. The specification is pretty flexible, allowing for data from more than one meter, fuel type, set of units, view of the same underlying data, and location in the same file. In practice, 95% of implementation seem to be single location whole building electricity, with some daily or monthly NG readings. Depending on your needs, you may also want to support csv or other formats within the same framework.

For all of these reasons, I recommend collecting as many examples as you can get your hands on and setting up a good test framework to ensure compatibility with each. I also recommend putting your parser and parser output behind an interface that is not XML specific in method calls or returned data.

Have you had a chance to locate and look through the parsing code from this project? If not, you will find that I took a pretty pragmatic approach, parsing everything into a DOM (assuming the data feeds wont be too huge) and running tree expression matches and looping through children to pull out the data into a light weight data structure. Depending on your goals, this might be too memory intensive, a poor match for the volume of data you want to process, or more or less what you want to do.

I would be happy if your work begins as a port of some of this project's code to Ruby. It isn't all as clean as I would like, but the code is in pretty good shape.

Let me know how else I can be of help.

sam

On 1/13/2014 11:57 AM, Charles Worthignton wrote:

Hi - I'm interested in getting your input on parsing strategies for Green Button XML file. I'm considering starting a ruby implementation of a Green Button parser gem. I think your input would be really helpful, since you've done it once. Would you be interested in chatting sometime in the next week or two?

Charles Presidential Innovation Fellow @ DOE working on Open Data / Green Button

— Reply to this email directly or view it on GitHub https://github.com/sborgeson/building-data-analysis/issues/1.

Yes, the strategy you take in your parser is similar to what I would do based on what I can tell about the Green Button spec, assuming that most use cases would be dealing with files that are not too large to be realistically handled in memory. Looping through the whole document looking for related notes based on links of an arbitrary number and length seems somewhat inelegant but I can't really think of a better way to do it. I wondered if you had considered any other approaches, or once you were done with this implementation had any insights about "I wish I had done it differently."

I definitely want to expose an API to the data that does not require the developer to understand the structure of the XML file; I think this is a major impediment to developing Green Button apps as it currently stands because parsing the data from the XML is so difficult.

I will use your code as an inspiration for mine. I'm also going to be helping organize a series of hackathons in the next two months where a number of developers will be exposed to Green Button for the first time. I plan to point them to your implementation as a good example, so you may see some additional traffic on this repo.

Have you considered breaking your parser out into a separate library?

On 1/13/2014 12:24 PM, Charles Worthignton wrote:

Yes, the strategy you take in your parser is similar to what I would do based on what I can tell about the Green Button spec, assuming that most use cases would be dealing with files that are not too large to be realistically handled in memory. Looping through the whole document looking for related notes based on links of an arbitrary number and length seems somewhat inelegant but I can't really think of a better way to do it. I wondered if you had considered any other approaches, or once you were done with this implementation had any insights about "I wish I had done it differently."

Actually, using those links to self within entries in the gbXML samples to determine what type of content is enclosed is redundant to doing the identification using the name of the first child of the entry's "content" node. The latter is probably a safer and more general way to do it. I changed the code to the "entryType" method when I figured this out and that is what the for content loop at the end does. For backwards compatibility I decided that is a link to self is there I would continue using it. Otherwise I look at the name of the first child of the content node. If I had to do it again, I think the content node alone would suffice. However, there may be exotic cases where there are multiple "instances" of the same content types and the instance identifiers provided by the links to self could prove useful.

I definitely want to expose an API to the data that does not require the developer to understand the structure of the XML file; I think this is a major impediment to developing Green Button apps as it currently stands because parsing the data from the XML is so difficult.

I have found that the biggest impediment is getting people to realize that they have access to their data and getting them to figure out how to log in and get it!

I will use your code as an inspiration for mine. I'm also going to be helping organize a series of hackathons in the next two months where a number of developers will be exposed to Green Button for the first time. I plan to point them to your implementation as a good example, so you may see some additional traffic on this repo.

Have you considered breaking your parser out into a separate library?

Hackathon usage would be great. I have thought quite a bit about the parsing being a separate library. It probably should be, but I was more interested in getting people to do visualizations and analysis with the data, so I wanted everything together. In retrospect, it should probably be separate. Maybe a hackathon participant would be willing to help separate it.

— Reply to this email directly or view it on GitHub https://github.com/sborgeson/building-data-analysis/issues/1#issuecomment-32207437.

sborgeson / building-data-analysis

Parsing Strategies for Green Button XML #1