steve8x8 / geotoad

Geocaching query tool written in Ruby
https://buymeacoffee.com/steve8x8
Other
28 stars 8 forks source link

OCM is crashing importing GPX-Files from Geotoad #264

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
[Please include the relevant parts of your command line, if applicable.
Don't send your password though!]
1. Performing standard search for Geocaches
2.
3.

What is the expected output? What do you see instead?
[If possible, include the last ~10--20 lines of verbose output.]
Geotoad produces a normal GPX-File. But as i was told it is not compatible with 
GXP-Format. It seems Geotoad has problems with German Special Characters (Ü, 
Ä, Ö):

The Answer I get from the developer of OCM:

&#xFC is how you put "ü" in HTML. However, this isn't an HTML file, this is a 
GPX file, which is XML. In XML, that's illegal, and the XML tools I use to read 
XML complains when it sees that, I can't get around it. Your GPSr might read 
this file OK, but it probably will cause problems, I wouldn't be surprised if 
it ignores all the caches in the file below this one.  You should log a bug 
with Geotoad, they should either turn it into ü or instead it should appear as 
"&#xFC" in the GPX file. 

What version of the product are you using? On what operating system?
[Did you check you're using the latest version?]
GeoToad 3.16.5 in 

Please provide any additional information below.
For testing Purposes I have replaced all &#xFC with &#xFC and it works.

Original issue reported on code.google.com by GCNugget...@gmail.com on 24 Mar 2013 at 8:48

GoogleCodeExporter commented 9 years ago
I'm curious how "ü" would look like on a GPSr's display... (OCM seems to 
be the only application that has problems with GPX files from GeoToad, which 
worries me a bit.)
Although I suspect there's a difference between the description part, and the 
log comments. Do umlaut characters in both parts result in OCM crashes?

Original comment by Steve8x8 on 24 Mar 2013 at 7:27

GoogleCodeExporter commented 9 years ago
(Another quick Q:) Are you using Ruby 1.8 or 1.9(+)?

Original comment by Steve8x8 on 24 Mar 2013 at 7:29

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I'm using Ruby 1.8 (Debian Lenny). I've never had problems with geotoad and 
ocm. Only since the last weeks. I've tried to replace all ü with &#xFC in my 
GPX-Files via Shell-Script but it does not solve the problem for all my 
gpx-Files. As far as i can see there are no problems with umlaut characters in 
the text (äöü).

Original comment by GCNugget...@gmail.com on 25 Mar 2013 at 6:40

GoogleCodeExporter commented 9 years ago
Yet another remark.

cache.xsd from GroundSpeak (unchanged for quite some time) contains the lines
    <!--  html is a boolean. If html=true the enclosed text contains html -->
    <xs:attribute name="html" form="unqualified" type="xs:string" />
for both the sort_description and the long_description element.
(Of course, this is not what I'd call "well documented"...)
In the past, they had been set to True in PQ results - this holds for a sample 
I received back in 2010. Nowadays, this seems to have changed. (I'm curious 
when this happened. COuld this be related to the "Unicode" transition a few 
weeks ago?)

So to me this reads "if html=true do not rely on the contents to conform to 
XML".

Log entries are a different story. There's no "html" attribute but an "encoded" 
one, and while there's no explanation, the line looks familiar:
    <xs:attribute name="encoded" form="unqualified" type="xs:string" />
For "historical reasons" (note the lame excuse) this had been set to "False" in 
versions as old as 3.7.5 (September 2004, the oldest one I could find). This 
might indeed be wrong *if* the context is the same. No real-world device seems 
to care.

So in short, GPX files as defined by GroundSpeak allow contents to come in 
HTML, and they actually did (at least in the description part). If OCM can only 
handle html=False, that's a pity, to say the least.

I'm not convinced GeoToad would convert to html=False in a foreseeable future 
(but I'm short-sighted), and certainly not with Ruby 1.8 still supported. 
Windows' somewhat peculiar support for UTF-8 adds to the situation.

As I'm rather unhappy with the current situation, I'd like to leave this bug 
open to collect more information:
What would GPX files (and corresponding xsd) have looked like in the past 13 
years? When did "groundspeak:*_description html" and "groundspeak:text 
encoding" actually change? Anyone who has kept old PQs, or individual cache 
GPXes?

Original comment by Steve8x8 on 25 Mar 2013 at 7:00

GoogleCodeExporter commented 9 years ago
Hm. Citing 
http://richesse-gps.googlecode.com/svn/branches/2.0/richesseGPS/files/gpx.cpp:

struct CLogEntry {
    SYSTEMTIME Date;
    CString Text;
    BOOL Encoded;           // TRUE if HTML
    CString Type;           // TODO: use enum
    CString Finder;

    CLogEntry() {
        memset(&Date, 0, sizeof(Date));
        Encoded = TRUE; 
    }
};

This seems to support my theory that HTML is allowed in log entries as well - 
although I still have to see some proper documentation if there's any beyond 
GS's xsd.
Next upload to my branch will
- use encoded="False" for empty or text-only logs (like the fake "info" one)
- set encoded="True" for "real" HTML logs
if ongoing tests are successful.

I'm afraid e.g. Garmin's parser implementation (in the Oregon x00 series) 
doesn't care too much (it has worse problems). Seems to be a tricky business to 
validate GS GPX (gpsbabel does a nice job but obviously doesn't catch 
everything)...

Original comment by Steve8x8 on 26 Mar 2013 at 10:08

GoogleCodeExporter commented 9 years ago
C:GEO seems to also be affected of this issue. It can not import the same 
GPX-Files, Error is something like wrong Format for GPX V1.1

Original comment by GCNugget...@gmail.com on 28 Mar 2013 at 10:00

GoogleCodeExporter commented 9 years ago
I'm surprised - I'm a c:geo user myself, and never had issues. Actually I am 
currently doing a GPX import as part of a holidays preparation...
Can you reproduce the behaviour with the GPX output of a random single-WID 
query? If not, it's probably time to bisect your problematic GPX file, and 
isolate the problematic part.

Another observation: Log entries with line feeds instead of <br /> tags will 
not be displayed properly on a Garmin x00, independent of the encoded=... 
setting. More investigation required.

Original comment by Steve8x8 on 28 Mar 2013 at 5:34

GoogleCodeExporter commented 9 years ago
If it helps, i have attached a File which make some troubles.

Original comment by GCNugget...@gmail.com on 1 Apr 2013 at 8:57

Attachments:

GoogleCodeExporter commented 9 years ago
buchholz.gpx: GPX: XML parse error at line 7212 of 'buchholz.gpx' : reference 
to invalid character number
      <groundspeak:text encoded="False">Heute wieder zur FB nach Hamburg. Mit Claudia die ich gestern mit dem Cachevirus infiziert habe��. Sie hat hier gleich zugegriffen. 2x gecacht und 2x die gleiche Dose.</groundspeak:text>

See Issue 262 - could this be the problem?
What happens if you open the file in a text editor, and remove the "��" 
Emoji stuff (which is supposed to be a "grinning face" according to my Unicode 
tables)?

Original comment by Steve8x8 on 18 Apr 2013 at 1:20

GoogleCodeExporter commented 9 years ago
Nothing changed. C:GEO gives the same error and OCM gives some Errors about 
"Referenced character was not allowed in XML.". But i can't find it in gedit. 
Maybe OCM is counting the characters in an other way.

Original comment by GCNugget...@gmail.com on 21 Apr 2013 at 12:15

GoogleCodeExporter commented 9 years ago
Okay... gpsbabel only chokes on the first problem with the input file, but 
there were two occurrences of "Emoji" characters in the GPX file.
I'm uploading a copy without those. Does it still refuse to be loaded by c:geo, 
and by OCM?

Original comment by Steve8x8 on 23 Apr 2013 at 7:03

Attachments:

GoogleCodeExporter commented 9 years ago
I have browsed the ocm sources for a place where html="True"/"False" would be 
parsed, but didn't find any. Same for text encoding="..."

Although there's only a package available for Ubuntu, I managed to install ocm 
1.0.13 on my Debian Wheezy laptop (merely by brute force), and imported 
buchholz2.gpx - without any problem.

This leads me to the assumption that "Emoji" has been the culprit here as well, 
like in Issue 262 (ans, subsequently, 266).
An automatic replacement of all occurrences of "&#" with "&#" would have masked 
UTF-16 surrogates as well (both the ඃ?;
???; ones and their decimal 
equivalents, starting with 7???;), sure.
Please try to boil down the problem to a few individual cache IDs, and point me 
to them (or send the corrresponding GPXes, again).
gpsbabel has proven to properly detect remaining surrogates, so it's probably a 
good idea to check files with gpsbabel first, and iterate over the errors it 
flags.

Original comment by Steve8x8 on 23 Apr 2013 at 11:55

GoogleCodeExporter commented 9 years ago
Updated geotoad to Version 3.16.6 and there are no problems any more.

Thank You.

Original comment by GCNugget...@gmail.com on 25 Apr 2013 at 9:11

GoogleCodeExporter commented 9 years ago
Assuming that the whole issue was triggered by that UTF-16 stuff that had been 
introduced in early March into GS pages, it's probably time to merge this into 
Issue 262.

Original comment by Steve8x8 on 26 Apr 2013 at 7:13

GoogleCodeExporter commented 9 years ago
I'm the developer of OCM.

"
che.xsd from GroundSpeak (unchanged for quite some time) contains the lines
    <!--  html is a boolean. If html=true the enclosed text contains html -->
    <xs:attribute name="html" form="unqualified" type="xs:string" />
for both the sort_description and the long_description element."

What the attribute means is that the text value of the attribute should be 
interpreted as HTML, not that you can put arbitrary HTML between 
<short_description>...</short_description>. You can simply use a <![CDATA]]> 
tag to prevent XML parser issues with the actual content, you don't actually 
have to double encode things. N.B. groundspeak goes the double encoding route 
in their GPX's, OCM uses CDATA in it's export, both are valid XML files

i.e.
<short_description html="false"><![CDATA[Hi <br> How are 
you]]></short_description> means a GPX renderer should display the text "Hi 
<br> How are you" exactly as is verbatim without turning the <br> into a line 
break.

<short_description html="true"><![CDATA[Hi <br> How are you]]></short 
description> means you should display the text "Hi [new line] How are you"

<short_description html="true">Hi <br> How are you</short_description> is 
invalid, since while you can have a <br> without a closing tag in HTML, it's 
not valid in XML.

OCM doesn't bother looking at the flag, because html has always been true 
historically, and so it simply takes the contents and renders it to an internal 
web browser. Groundspeak doesn't use html logs, but some of the opencaching 
sites do in their GPX files.  OCM doesn't need to character count, as XML 
parsing is built-in to C# and Java.

Original comment by kmcamp...@gmail.com on 8 May 2013 at 1:34

GoogleCodeExporter commented 9 years ago
er...should say "text value of the element", not attribute

Original comment by kmcamp...@gmail.com on 8 May 2013 at 1:37

GoogleCodeExporter commented 9 years ago
Thanks for the lesson, although I'm not sure what I should have learned now?
Apparently, the issue was caused not by improperly reading xsd files, but by 
the introduction of Unicode (UTF-16 surrogates), and has vanished since.

Original comment by Steve8x8 on 8 May 2013 at 11:42

GoogleCodeExporter commented 9 years ago
I wasn't trying to criticize you, it was just the reasoning why I sent this 
user to you and a justification since you seem to imply that I was interpreting 
the gpx incorrectly. 

The GPS wouldn't render "&#xD83D;" it would have been turned into "�" by the 
device after parsing. This is what I meant by double encoding. 

You can skip encoding altogether if you use the CDATA marker instead, which I 
find is just easier. It wouldn't matter what gc.com does to their HTML, because 
it becomes transparent to the parser then.

i.e instead of <short_description><br/>l&#xD83D;</short_description>

you can do <short_description><![CDATA[<br/>&xD83D]]></short_description>, the 
parser will treat everything between <![CDATA[  and ]]> as element text. 

Original comment by kmcamp...@gmail.com on 8 May 2013 at 2:03

GoogleCodeExporter commented 9 years ago
Anyway, issue solved, so no big deal

Original comment by kmcamp...@gmail.com on 8 May 2013 at 2:14