valentinedwv / ioostech

Automatically exported from code.google.com/p/ioostech
0 stars 0 forks source link

GetObservation SWE blockSeparator encoding #34

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
The Python 2.6.6 XML minidom library has issues when reading the newline 
character, '\n',  as a block separator in SWE Common responses. The minidom 
library converts the ‘\n’ character to a space. NDBC changed their 
responses to use the HTML entity, '
'.  Recommend that the template 
(http://code.google.com/p/ioostech/source/browse/trunk/templates/Milestone1.0/GO
-Station-SingleFT-timeSeries-MultiSensor.xml) be changed to suggest the use of 
HTML entities.

Original issue reported on code.google.com by mike.gar...@gmail.com on 16 Oct 2012 at 5:32

GoogleCodeExporter commented 8 years ago
Thanks, Mike. That seems reasonable to me. It also seems consistent that a DOM 
library such as the minidom would want HTML entities.

I can update the GetObservation template myself, but are there any comments? 
Does the OGC SWE Common spec say anything about this? Alex, do you know?

Original comment by emilioma...@gmail.com on 16 Oct 2012 at 5:44

GoogleCodeExporter commented 8 years ago
The SWE 2.0 spec gives an example with 
 as the blockSeparator, so I think it's safe to say that it's legal. I tried to chase down the definition of CharacterString in ISO 19103 to see if they restrict it at all, but it's one of those "open standards" that you have to pay for to view :P

Original comment by sh...@axiomalaska.com on 16 Oct 2012 at 5:53

GoogleCodeExporter commented 8 years ago
HTML Entities can also be used for the tokenSeparator, but a comma or space 
don't pose any problems for the minidom.  Perhaps we should just recommend 
using HTML entities in place of any special values (back-slashed values) in 
these separator attributes for consistency.

Original comment by mike.gar...@gmail.com on 16 Oct 2012 at 5:59

GoogleCodeExporter commented 8 years ago
What is the actual issue?  Reading the value of the blockSeperator attribute?

I'm against telling people they should use the HTML entities as a block or line 
separator.  They should be able to use what they want, like:

<swe2:TextEncoding decimalSeparator="." 
tokenSeparator="&&&!!!)))####\n\n\n\r\r\r\r" 
blockSeparator="ILOVEBLOCKSEPERATORS"/>

On the argument of "\n" not working with the 'minidom' module in Python 2.6.6, 
the HTML block separator will make it incompatible with the 'csv' module, since 
the default line terminator is not configurable and will only recognize '\r' or 
'\n'.  This bit me hard when writing the SWE parser in Python.

From http://docs.python.org/library/csv.html#module-csv:
Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, 
and ignores lineterminator. This behavior may change in the future.

I think we should just leave it to the specifications to define the 
TextEncoding block and leave it to the parsing libraries to handle the spec 
correctly.

Original comment by wilcox.k...@gmail.com on 16 Oct 2012 at 6:04

GoogleCodeExporter commented 8 years ago
All we are doing is using an HTML entity to specify the value of the 
separators.  We are not replacing the actual separator value with '
' or anything else.  As stated in the issue, the minidom converts '\n' to a 
space for some unknown reason.  This conversion makes proper decoding 
impossible.  I have no control over the minidom, but I do have control over my 
SWE responses.  No one is suggesting that the CSV encodings change.

BTW, let me know how that "&&&!!!)))####\n\n\n\r\r\r\r" separator works for 
you. :)

Original comment by mike.gar...@gmail.com on 16 Oct 2012 at 6:16

GoogleCodeExporter commented 8 years ago
Maybe I'm not understanding the problem...

If you change the blockSeperator value to '
', you will have to separate the blocks with '
' in the CSV or there will be a disconnect.

Original comment by wilcox.k...@gmail.com on 16 Oct 2012 at 6:28

GoogleCodeExporter commented 8 years ago
Any good parser will resolve all XML/HTML entities into their real values 
before using them.  The blocks are not separated by "\n" either.  It's just 
another way of stating a character that we cannot print in text format.

Original comment by mike.gar...@gmail.com on 16 Oct 2012 at 6:43

GoogleCodeExporter commented 8 years ago
So I see a couple of issues here.
1. Matching blockSeparator and "CSV" (<swe2:values>) value. Or as Kyle put it:
> If you change the blockSeperator value to '
', you will have to 
> separate the blocks with '
' in the CSV or there will be a disconnect.

Hmm, both this and Mike's response seem logical to me ...

2. This (among others) is the reason why we chose to *hard-wire* these settings 
rather than let implementations truly do whatever they want! We discussed this 
at the February workshop, though some may have easily missed it in the constant 
back and forth. We've had for months this prominent inline comment on the 
GetObs template:
  SWE encoding and data values
  swe:encoding *must* be always specified exactly as described below,
to avoid the need to have fully general parsers that interpret
swe:TextEncoding. That is, parsers may hard-code this particular
swe:TextEncoding specification.
<swe2:encoding>
  <swe2:TextEncoding decimalSeparator="." tokenSeparator="," blockSeparator="\n"/>
</swe2:encoding>

Meaning: at least for Milestone 1.0, all SOS servers must usee that 
swe2:TextEncoding configuration! We want to make writing clients easier, even 
simple, hard-wired ones. If changing blockSeparator="\n" to blockSeparator="
" make this easier w/o introducing other problems, and w/o actually using the 
string "
" to separate blocks on the swe2:value block, that's fine with me.

Original comment by emilioma...@gmail.com on 17 Oct 2012 at 12:12

GoogleCodeExporter commented 8 years ago

Original comment by emilioma...@gmail.com on 17 Oct 2012 at 7:24

GoogleCodeExporter commented 8 years ago
[ERIC] I think Kyle's point, and I agree, is that you cannot use 
blockSeparator='
' without also using it in the swe2:value block.
AFAICT this is really an XML spec issue, not SOS.  

Is it an XML issue? It seems to me more like an SOS spec issue. In fact, I've 
found references to exactly the use of blockSeparator='
' together with the expected swe2:value block using a newline (\n) character:
http://external.opengis.org/twiki_public/HydrologyDWG/GwIeSOSResultFilter
And more clearly, see the sample code listing in p. 89 of OGC SWE Common Data 
Model Encoding Standard, OGC 08-094 
(http://portal.opengeospatial.org/files/?artifact_id=38475&version=2&format=pdf)

So it would seem Mike isn't off base. Has anyone found a more definitive spec 
description?

> What XML lib did Kyle use for his SOS Parser?
[KYLE] I'm using the ElementTree API:
https://github.com/asascience-open/pyoos/blob/master/pyoos/utils/etree.py

Yeah, if you're parsing XML in Python the ElementTree API or something truly 
XML-oriented is the way to go. The minidom is probably not the best choice.

Regardless, the issue of parsing library behavior is different from the main 
issue of what's compliant with specs and a good practice.

Original comment by emilioma...@gmail.com on 18 Oct 2012 at 6:01

GoogleCodeExporter commented 8 years ago
Although most ISO standards are not free, it is possible to look for their 
drafts. So, I found thу free draft of the ISO 19103 somewhere in Australia 
(why am I not surprised?) at 
https://www.seegrid.csiro.au/wiki/pub/AppSchemas/SchemaFormalization/19103_DIS20
010712.pdf. The draft may be different from the final version but I don't think 
that this is the case. So, that's what it says about CharacterString:

"A CharacterString is an arbitrary-length sequence of characters including 
accents and special characters from repertoire of one of the adopted character 
sets:
- ISO/IEC 10646-1: Universal Character Set (UCS) repertoire implementation 
level 2 also called the Base Multilingual plane of ISO 10646 (i.e. Including 
Latin, Greek, Cyrillic, Arabic, Chinese, Japanese etc.)
- ISO/IEC 10646-2: Universal Character Set repertoire UCS-4, the full 
repertoire of ISO 10646.

EXAMPLE "Ærlige Kåre så snø for første gang."

The maximum length of a CharacterString is dependent on encapsulation and 
usage. A language tag shall be provided for identification of the language of 
string values.
For an implementation mapping of a CharacterString, the handling of the 
following four aspects need to be decided:
1) Representation of value
2) Representation of character set
3) Representation of encoding
4) Representation of language
This can be handled for instance by choosing ISO/IEC 10646-1."

So, practically ANY character is allowed in CharacterString, especially 
characters that are part of "C0 Control and Basic Latin Set" such as "&" (ISO 
10646 code = 0026), "#" (ISO 10646 code = 0023), numbers, etc.

Don't know if it makes your life any easier...

Original comment by abir...@gmail.com on 18 Oct 2012 at 7:03

GoogleCodeExporter commented 8 years ago
from Emilio's comment of Oct 18.

[ERIC] I think Kyle's point, and I agree, is that you cannot use 
blockSeparator='
' without also using it in the swe2:value block.

Has anyone tested this in code?  Is this a true statement? Does ElementTree 
work?  Minidom?

from Emilio's comment of Oct 16

So I see a couple of issues here.
1. Matching blockSeparator and "CSV" (<swe2:values>) value. Or as Kyle put it:
> If you change the blockSeperator value to '
', you will have to 
> separate the blocks with '
' in the CSV or there will be a disconnect.

Just to be clear, when you say "in the CSV" you mean in the 
<swe2:values></swe2:values> block in the same GO response file.  You do NOT 
mean, in the CSV file that would have resulted from issuing the same SOS 
request but changing the output format to CSV and not O&M1.0.  Correct?

It seems like our options are:
1. Leave the blockSeparator unconstrained and open to interpretation by the 
data provider.  The risk is that minidom based parsers (and presumably some 
others) may not be able to parse the output.
2. Tightly constrain the blockSeparator and tokenSeparator (eliminate \n as a 
possibility) and make certain that the chosen set of separators works with the 
XML parsing libraries we know of and use regularly.  There is still a risk of 
choosing a separator that is incompatible with a library.  There is also the 
risk (or certainty) that Kyle will not be able to use

 http://docs.python.org/library/csv.html#module-csv

in his parsing library and will have to find another solution.

If someone can verify the statement at the top of this message and provide a 
comment or clear statement of the guidance then maybe we can put this to bed. 

Please vote for option 1 or 2 or provide another recommendation.

Original comment by dpsnowde...@gmail.com on 23 Oct 2012 at 9:43

GoogleCodeExporter commented 8 years ago
Page 107 of the SWE 2.0 spec (08-094r1) says:

  Special characters such as carriage returns (CR) or line feeds (LF) can be used as
  block or token separators by using XML entities. For example new line characters
  are often used as block separators to cleanly separate blocks of values on
  successive lines:

  <swe:TextEncoding tokenSeparator=";" blockSeparator="
"/>

  This corresponds to the following data block format:

  25.41;10.23;320
  25.43;10.23;300
  25.39;11.51;310

  This is compatible with the CSV format and is often used for compatibility
  with other software.

So...the spec seems to be pretty clear about using XML entities to specify 
special characters. Kyle, can you run the response through an unencoder before 
feeding it to module-csv?

Original comment by sh...@axiomalaska.com on 23 Oct 2012 at 11:26

GoogleCodeExporter commented 8 years ago
I don't think we need to make a recommendation here.  Whatever is in the 
blockSerperator is what should actually be separating the blocks, and that is 
covered in the SWE spec.

If the blockSeparator is "
" then the data should be seperate by "
" and not "\n".

As long as they are consistent the parsing is not an issue.

Original comment by wilcox.k...@gmail.com on 24 Oct 2012 at 3:14

GoogleCodeExporter commented 8 years ago
Hi Kyle, and congrats. Hope all is well.

This stuff gets confusing since we're talking about both encoded or escaped 
representations of separators and string literals. What I think you're saying 
is that parsers should consider whatever is specified as a blockSeparator (or 
any other separator in swe:TextEncoding) as a string literal and look for 
exactly that literal when parsing the swe:values. That's not what I'm seeing in 
the SWE 2.0 spec; the spec indicates that XML entities specified as separators 
in swe:TextEncoding should be resolved to the target ASCII character, and that 
this resolved string should be used when parsing swe:values. The spec says 
nothing about backslash-escaped strings like \n.

That's not an endorsement of the approach, but that's how I'm reading the SWE 
2.0 spec (section 8.5.2 pg 107 and all examples in the doc). Maybe I'm missing 
something.

Original comment by sh...@axiomalaska.com on 24 Oct 2012 at 5:11

GoogleCodeExporter commented 8 years ago
Adding some discussion from the mailing list: 

Eric Bridger:
> I've changed my mind about this issue.
> I ran some tests using the Perl binding to C LibXML, lxml is the python 
binding to 
> the same library.
> Used: GO-Station-SingleFT-timeSeries-MultiSensor.xml as the starting point.

> BOTH blockSep="\n" and blockSep="
" work on swe2:values where \n is the block
> separator.
> Interestingly it  also works where swe2:values uses 
 as the block separator. 
> AND with blockSeparator="\n" and 
 in swe2:values

> I can share this code and example files if wanted.

Derrick Snowden:
> Not sure I remember what your first impression was.  Can you state what your
> recommendation is based on your test?

Eric Bridger:
> Basically that blockSeparator="
" should work fine, both with minidom and other
> XML parsers.

> The <swe2:values> string can look like what Shane found in the SWE 2.0 spec
> examples, note the rows have newlines at the end of each line.
> AFAICT blockSeparator="\n" will also work for most parsers, but not python 
minidom,
> as Mike discovered.

Original comment by sh...@axiomalaska.com on 26 Oct 2012 at 9:29

GoogleCodeExporter commented 8 years ago
After some thinking, I do agree with Kyle that we shouldn't make a rule 
specifying which separators to use; there's no real reason for us to go outside 
of the spec here, and if we don't have to we shouldn't. Clients should 
preprocess documents if they need to convert to something friendly to their 
parsing library.

BUT, also in keeping with the spec, we should use encoded XML entities (like 
) when specifying special characters like linefeeds in the separators, instead 
of using backslash-escaped characters.

Side note: since the spec says that encoded XML entities in the separators 
should be unencoded before parsing the swe:values, we can't use entities 
representing characters that are illegal in XML documents like &, <, >, etc, 
since this would require putting those illegal values directly in swe:values. 
At least that's how I interpret it.

We need to push to resolve this issue as it's one of the thing holding up the 
completion of Milestone 1.0.

Original comment by sh...@axiomalaska.com on 26 Oct 2012 at 9:51

GoogleCodeExporter commented 8 years ago
+1. I think the confusion arose because Kyle said if you use blockSeparator="
" then the swe:values element must also contain literal '
's unless I misunderstood him.
So I agree, no rule specifying what separator to use, but the template examples 
should use "
" 

Original comment by ebrid...@gmri.org on 27 Oct 2012 at 3:20

GoogleCodeExporter commented 8 years ago
Suggest we close this issue.  Unless there is a dissenting opinion I'll close 
it on 1/7/2013.

Summary:
no rule specifying what separator to use, but the template examples should use "
"
BUT, also in keeping with the spec, we should use encoded XML entities (like 
) when specifying special characters like linefeeds in the separators, instead 
of using backslash-escaped characters.

Side note: since the spec says that encoded XML entities in the separators 
should be unencoded before parsing the swe:values, we can't use entities 
representing characters that are illegal in XML documents like &, <, >, etc, 
since this would require putting those illegal values directly in swe:values. 
At least that's how I interpret it 

Original comment by dpsnowde...@gmail.com on 3 Jan 2013 at 5:08

GoogleCodeExporter commented 8 years ago
Sorry to be a contrarian, but I'd like to tweak Derrick's summary (which is the 
consensus opinion so far):
no rule specifying what separator to use, but the template examples should use "
"
I'd add this qualifier: "and this convention is strongly recommended"

I think my difference in perspective is that I'm not thinking exclusively of 
hard-core developers and programmers (or pre-existing client libraries) as the 
only audience. I'd like to see our responses be as friendly as possible to 
sloppy programmers who want to maximize hard-coding. I like describing the 
swe2:values block as basically "conventional CSV", except om:result must also 
be parsed in order to get access at the header info and block interpretation.

I see this as being little different than saying that we strongly recommend the 
use of decimalSeparator="." and tokenSeparator="," in swe2:TextEncoding:
<swe2:TextEncoding decimalSeparator="." tokenSeparator="," blockSeparator="
"/>
eg, do we seriously want to allow IOOS SOS servers to choose to use, say, 
decimalSeparator=","?? The hispanic in me says that'd be pretty cool (both , 
and . are in common usage in Spanish-speaking countries, depending on the 
degree of european vs. american influence), but the pragmatic programmer in me 
says that'd be f**d up, even if fully legal and standard-compliant.

Finally, just to make sure the recommendation being made is 100% clear, are we 
saying that:
- IF you use blockSeparator="
" in swe2:TextEncoding:
<swe2:TextEncoding decimalSeparator="." tokenSeparator="," blockSeparator="
"/>
- THEN also use "
" (not \n) as the block separator in swe2:values
??

Based on Shane's Comment 13 and other comments, is there anything wrong with 
saying we strongly recommend using blockSeparator="
" in swe2:TextEncoding, and \n as the block separator in swe2:values?

Original comment by emilioma...@gmail.com on 7 Jan 2013 at 1:59

GoogleCodeExporter commented 8 years ago
Generally I've wanted to avoid making rules beyond the existing specs when we 
don't have to...but in essence this whole IOOS SOS standards effort is an 
exercise in limiting (and augmenting) the existing SOS spec to make things 
easier and less ambiguous for servers and clients. So, although I lean toward 
not mandating the separators, I don't really care that much and am fine with 
strong encouragement.

The encoded and unencoded string literal separators have been a source of great 
confusion during this issue. The recommended rules are:

Use XML entities in TextEncoding to represent special characters. Use of 
standard CSV separators is strongly recommended:
<swe2:TextEncoding decimalSeparator="." tokenSeparator="," blockSeparator="
"/>

Use unencoded separators (e.g. actual line feeds) in the values block, example:
  25.41,10.23,320
  25.43,10.23,300
  25.39,11.51,310

I want to stay away from saying "use \n in swe:values" because this could be 
interpreted as using the string literal "\n". which is not correct.

Closing tomorrow afternoon pending protests.

Original comment by sh...@axiomalaska.com on 10 Jan 2013 at 10:25

GoogleCodeExporter commented 8 years ago
Added to rules wiki, closing.

http://code.google.com/p/ioostech/wiki/SOSGuidelines_final#Observation_values_Te
xtEncoding_separators

Original comment by sh...@axiomalaska.com on 18 Jan 2013 at 9:30