placemark / togeojson

convert KML, TCX, and GPX to GeoJSON, without the fuss
https://placemark.github.io/togeojson/
BSD 2-Clause "Simplified" License
410 stars 67 forks source link

Use htmlparser2 for tolerant & webworker-friendly parsing #55

Closed davetapley closed 2 years ago

davetapley commented 3 years ago

This took some 🔍 so reporting here in case it catches others out:

I've been using togeojson successfully until we happened upon a KML which would always return zero features, e.g:

const text = await response.text();
const xml = new window.DOMParser().parseFromString(text, "text/xml");
const json = kml(xml);
console.log(url, json.features.length)

//  file.kml 0

More confusingly running in node or the CLI would find them correctly, e.g:

$ npx @tmcw/togeojson-cli file.kml | jq -e '.features | length'
18

I tracked it down to the file.kml not having this in its kml tag:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

I'm not sure if that is supposed to be there or why it differs between the browser and xmldom? 🤔

tmcw commented 3 years ago

Can you post the file itself so I can help debug this?

davetapley commented 3 years ago

@tmcw sure! It was a monster, so I've removed all but one <Placemark> so I could fit in a gist, but you can still see the issue:

huc6-no-xsi.kml

huc6-with-xsi.kml


Here's a Codepen where you can see 0 vs 1 feature in browser: https://codepen.io/davetapley/pen/JjJRdBv?editors=0011

But with CLI both return the feature:

$ curl -s https://gist.githubusercontent.com/davetapley/c426f7ad68027bc292f5a03837ab50e4/raw/f5974f2e47c3af8c39b02116a160759624b0c0cd/huc6-no-xsi.kml | npx -q @tmcw/togeojson-cli | jq -e '.features | length'
1

$ curl -s https://gist.githubusercontent.com/davetapley/fa6f8e2e96b03f2911c27f20f4123107/raw/7f0b4295e367be50309be47e70f1fda3845b91bd/huc6-with-xsi.kml | npx -q @tmcw/togeojson-cli | jq -e '.features | length'
1
tmcw commented 2 years ago

Okay, I've encountered this again in the wild, and have a plan. Basically the options are:

Hence: renaming. Going to rewrite togeojson to use htmlparser2 (which can parse XML).

tmcw commented 2 years ago

Okay, I implemented this in #56 but I think that's the wrong solution. htmlparser2 is surprisingly a larger dependency than @xmldom/xmldom, and requires a lot more shifted-around code.

I think the best path here is actually just to recommend using @xmldom/xmldom as a parser, even on the web.