statedecoded / statedecoded

Legal codes, for humans.
https://www.statedecoded.com/
Other
250 stars 53 forks source link

Support JSON imports #270

Open waldoj opened 11 years ago

waldoj commented 11 years ago

The XML import system is swell, but XML is awful. For folks who don't want to deal with XML, but don't want to modify the parser just to import JSON, it's best to add some JSON support.

The trick here is going to be avoiding duplication of code. The XML extraction is pretty bespoke—there's nothing obvious about how JSON importing can live within the same system.

waldoj commented 11 years ago

(@krues8dr is working on this.)

krusynth commented 11 years ago

I only copied over the important bits from the DC Code branch, which were actually really simple. Then I juggled the error handling a bit and added a new requirement to the config file to give the path to where the import files should life. I also renamed that directory, because 'xml' isn't a very good name for a place to keep json files. ;)

krusynth commented 11 years ago

This probably could do with some testing - I plan tomorrow to use our generated JSON files to see if this will work as expected.

waldoj commented 11 years ago

This is great! I hope it works. :)

waldoj commented 11 years ago

(I've been on the go, getting to and traveling around Buenos Aires, so I'm not in a position to test this just now.)

krusynth commented 11 years ago

(@waldoj no worries. As an aside, I'm generally only closing things that I know that you've already looked at, or that I generally feel don't really need any looking over. Everything else I'm just leaving a note that it's 'Pretty Much Done'.)

krusynth commented 11 years ago

Ugh. I just wasted the entire morning trying to get this to actually run on our exported JSON files, but the format of those seems to be very different from what we're importing via XML as to be a complete rewrite.

So, technically, this does handle JSON, but the parsing is going to vary depending on the source. Is it worth actually implementing those details?

(Did not mean to close this. Oops.)

waldoj commented 11 years ago

I'd say the best thing to do, then, is to change the format of the JSON files. Does that seems reasonable?

krusynth commented 11 years ago

For future reference: for test files, I just ran our xml files through https://github.com/hay/xml2json

ls ./import-data/ | xargs -I {} python ~/Downloads/xml2json-master/xml2json.py -t xml2json ./import-data/{} -o ./json/{}.json

and then removed all the @s with sed

ls ./json/ | xargs -I {} sed -i 's/@//g' ./json/{}

which has gotten me 90% of the way there. Now just dealing with the fact that the xml parser and json parser return slightly different data types...

krusynth commented 11 years ago

... which means that a lot of this relies on SimpleXMLElements, which don't actually act like standard StdClasses. The normal json_decode(json_encode($obj)) isn't working to swap types here, so I'm trying other methods and considering other options. Are we set on SimpleXML, or would XMLReader be an acceptable replacement possibly?

krusynth commented 11 years ago

"SimpleXMLElement::children() returns a node object no matter if the current node has children or not."

http://php.net/manual/en/simplexmlelement.children.php :rage:

waldoj commented 11 years ago

I'm starting to feel like this might be a time-sink. If you don't feel like you're making adequate progress on this, just leave this where it's at. It's fine. There are plenty of other issues to be resolved. :)

krusynth commented 11 years ago

The thing that I'm working on now actually gets us closer to allowing any data source, so it's worth finishing up. We're doing a few things that are very SimpleXMLElement specific (mainly string-casting) and need attention anyway. I'll have another go tomorrow morning once I'm hyper-caffeinated.

waldoj commented 11 years ago

I'll definitely defer to your judgment because a) the flu has broken my brain and b) this is one of those issues you've got to get deep inside of to have a proper perspective on.

krusynth commented 11 years ago

Ok, so I got this mostly working, but it still needs work. I'm really stuck on some prefix_hierarchy madness that is breaking all the time. I'm reverting for now, but this should be revisited in the future - SimpleXML does things normal objects don't, and we're making some really bad assumptions in the (string) type conversions from objects there. To wit, most of this will probably only work on SimpleXML objects, and other data types might have issues.

My patch has two pieces. This function needs to be added to functions.inc.php and is used to translate those SimpleXML classes: https://gist.github.com/krues8dr/6463853

The second is a replacement for the base State parser - as I said, it's not working 100% at the moment, and needs more attention: https://gist.github.com/krues8dr/00d49b52e76f3b096b6d

I'm dropping this now and moving on to other issues.

krusynth commented 11 years ago

@waldoj ^

krusynth commented 10 years ago

Propose moving Milestone to Future. We've got a 90% solution here - but I've yet to see anyone with JSON data aside from what we're getting for DC (which is downstream data anyway).

waldoj commented 10 years ago

I've yet to see anyone with JSON data

As of 18 months ago, that looked like a distinct possibility. JSON imports was a case of skating to where the puck would be, or so I thought. But right now, I just don't see evidence of it happening. Future it is.