Closed GoogleCodeExporter closed 8 years ago
Thanks, Mike. That seems reasonable to me. It also seems consistent that a DOM
library such as the minidom would want HTML entities.
I can update the GetObservation template myself, but are there any comments?
Does the OGC SWE Common spec say anything about this? Alex, do you know?
Original comment by emilioma...@gmail.com
on 16 Oct 2012 at 5:44
The SWE 2.0 spec gives an example with
as the blockSeparator, so I think it's safe to say that it's legal. I tried to chase down the definition of CharacterString in ISO 19103 to see if they restrict it at all, but it's one of those "open standards" that you have to pay for to view :P
Original comment by sh...@axiomalaska.com
on 16 Oct 2012 at 5:53
HTML Entities can also be used for the tokenSeparator, but a comma or space
don't pose any problems for the minidom. Perhaps we should just recommend
using HTML entities in place of any special values (back-slashed values) in
these separator attributes for consistency.
Original comment by mike.gar...@gmail.com
on 16 Oct 2012 at 5:59
What is the actual issue? Reading the value of the blockSeperator attribute?
I'm against telling people they should use the HTML entities as a block or line
separator. They should be able to use what they want, like:
<swe2:TextEncoding decimalSeparator="."
tokenSeparator="&&&!!!)))####\n\n\n\r\r\r\r"
blockSeparator="ILOVEBLOCKSEPERATORS"/>
On the argument of "\n" not working with the 'minidom' module in Python 2.6.6,
the HTML block separator will make it incompatible with the 'csv' module, since
the default line terminator is not configurable and will only recognize '\r' or
'\n'. This bit me hard when writing the SWE parser in Python.
From http://docs.python.org/library/csv.html#module-csv:
Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line,
and ignores lineterminator. This behavior may change in the future.
I think we should just leave it to the specifications to define the
TextEncoding block and leave it to the parsing libraries to handle the spec
correctly.
Original comment by wilcox.k...@gmail.com
on 16 Oct 2012 at 6:04
All we are doing is using an HTML entity to specify the value of the
separators. We are not replacing the actual separator value with '
' or anything else. As stated in the issue, the minidom converts '\n' to a
space for some unknown reason. This conversion makes proper decoding
impossible. I have no control over the minidom, but I do have control over my
SWE responses. No one is suggesting that the CSV encodings change.
BTW, let me know how that "&&&!!!)))####\n\n\n\r\r\r\r" separator works for
you. :)
Original comment by mike.gar...@gmail.com
on 16 Oct 2012 at 6:16
Maybe I'm not understanding the problem...
If you change the blockSeperator value to '
', you will have to separate the blocks with '
' in the CSV or there will be a disconnect.
Original comment by wilcox.k...@gmail.com
on 16 Oct 2012 at 6:28
Any good parser will resolve all XML/HTML entities into their real values
before using them. The blocks are not separated by "\n" either. It's just
another way of stating a character that we cannot print in text format.
Original comment by mike.gar...@gmail.com
on 16 Oct 2012 at 6:43
So I see a couple of issues here.
1. Matching blockSeparator and "CSV" (<swe2:values>) value. Or as Kyle put it:
> If you change the blockSeperator value to '
', you will have to
> separate the blocks with '
' in the CSV or there will be a disconnect.
Hmm, both this and Mike's response seem logical to me ...
2. This (among others) is the reason why we chose to *hard-wire* these settings
rather than let implementations truly do whatever they want! We discussed this
at the February workshop, though some may have easily missed it in the constant
back and forth. We've had for months this prominent inline comment on the
GetObs template:
SWE encoding and data values
swe:encoding *must* be always specified exactly as described below,
to avoid the need to have fully general parsers that interpret
swe:TextEncoding. That is, parsers may hard-code this particular
swe:TextEncoding specification.
<swe2:encoding>
<swe2:TextEncoding decimalSeparator="." tokenSeparator="," blockSeparator="\n"/>
</swe2:encoding>
Meaning: at least for Milestone 1.0, all SOS servers must usee that
swe2:TextEncoding configuration! We want to make writing clients easier, even
simple, hard-wired ones. If changing blockSeparator="\n" to blockSeparator="
" make this easier w/o introducing other problems, and w/o actually using the
string "
" to separate blocks on the swe2:value block, that's fine with me.
Original comment by emilioma...@gmail.com
on 17 Oct 2012 at 12:12
Original comment by emilioma...@gmail.com
on 17 Oct 2012 at 7:24
[ERIC] I think Kyle's point, and I agree, is that you cannot use
blockSeparator='
' without also using it in the swe2:value block.
AFAICT this is really an XML spec issue, not SOS.
Is it an XML issue? It seems to me more like an SOS spec issue. In fact, I've
found references to exactly the use of blockSeparator='
' together with the expected swe2:value block using a newline (\n) character:
http://external.opengis.org/twiki_public/HydrologyDWG/GwIeSOSResultFilter
And more clearly, see the sample code listing in p. 89 of OGC SWE Common Data
Model Encoding Standard, OGC 08-094
(http://portal.opengeospatial.org/files/?artifact_id=38475&version=2&format=pdf)
So it would seem Mike isn't off base. Has anyone found a more definitive spec
description?
> What XML lib did Kyle use for his SOS Parser?
[KYLE] I'm using the ElementTree API:
https://github.com/asascience-open/pyoos/blob/master/pyoos/utils/etree.py
Yeah, if you're parsing XML in Python the ElementTree API or something truly
XML-oriented is the way to go. The minidom is probably not the best choice.
Regardless, the issue of parsing library behavior is different from the main
issue of what's compliant with specs and a good practice.
Original comment by emilioma...@gmail.com
on 18 Oct 2012 at 6:01
Although most ISO standards are not free, it is possible to look for their
drafts. So, I found thу free draft of the ISO 19103 somewhere in Australia
(why am I not surprised?) at
https://www.seegrid.csiro.au/wiki/pub/AppSchemas/SchemaFormalization/19103_DIS20
010712.pdf. The draft may be different from the final version but I don't think
that this is the case. So, that's what it says about CharacterString:
"A CharacterString is an arbitrary-length sequence of characters including
accents and special characters from repertoire of one of the adopted character
sets:
- ISO/IEC 10646-1: Universal Character Set (UCS) repertoire implementation
level 2 also called the Base Multilingual plane of ISO 10646 (i.e. Including
Latin, Greek, Cyrillic, Arabic, Chinese, Japanese etc.)
- ISO/IEC 10646-2: Universal Character Set repertoire UCS-4, the full
repertoire of ISO 10646.
EXAMPLE "Ærlige Kåre så snø for første gang."
The maximum length of a CharacterString is dependent on encapsulation and
usage. A language tag shall be provided for identification of the language of
string values.
For an implementation mapping of a CharacterString, the handling of the
following four aspects need to be decided:
1) Representation of value
2) Representation of character set
3) Representation of encoding
4) Representation of language
This can be handled for instance by choosing ISO/IEC 10646-1."
So, practically ANY character is allowed in CharacterString, especially
characters that are part of "C0 Control and Basic Latin Set" such as "&" (ISO
10646 code = 0026), "#" (ISO 10646 code = 0023), numbers, etc.
Don't know if it makes your life any easier...
Original comment by abir...@gmail.com
on 18 Oct 2012 at 7:03
from Emilio's comment of Oct 18.
[ERIC] I think Kyle's point, and I agree, is that you cannot use
blockSeparator='
' without also using it in the swe2:value block.
Has anyone tested this in code? Is this a true statement? Does ElementTree
work? Minidom?
from Emilio's comment of Oct 16
So I see a couple of issues here.
1. Matching blockSeparator and "CSV" (<swe2:values>) value. Or as Kyle put it:
> If you change the blockSeperator value to '
', you will have to
> separate the blocks with '
' in the CSV or there will be a disconnect.
Just to be clear, when you say "in the CSV" you mean in the
<swe2:values></swe2:values> block in the same GO response file. You do NOT
mean, in the CSV file that would have resulted from issuing the same SOS
request but changing the output format to CSV and not O&M1.0. Correct?
It seems like our options are:
1. Leave the blockSeparator unconstrained and open to interpretation by the
data provider. The risk is that minidom based parsers (and presumably some
others) may not be able to parse the output.
2. Tightly constrain the blockSeparator and tokenSeparator (eliminate \n as a
possibility) and make certain that the chosen set of separators works with the
XML parsing libraries we know of and use regularly. There is still a risk of
choosing a separator that is incompatible with a library. There is also the
risk (or certainty) that Kyle will not be able to use
http://docs.python.org/library/csv.html#module-csv
in his parsing library and will have to find another solution.
If someone can verify the statement at the top of this message and provide a
comment or clear statement of the guidance then maybe we can put this to bed.
Please vote for option 1 or 2 or provide another recommendation.
Original comment by dpsnowde...@gmail.com
on 23 Oct 2012 at 9:43
Page 107 of the SWE 2.0 spec (08-094r1) says:
Special characters such as carriage returns (CR) or line feeds (LF) can be used as
block or token separators by using XML entities. For example new line characters
are often used as block separators to cleanly separate blocks of values on
successive lines:
<swe:TextEncoding tokenSeparator=";" blockSeparator="
"/>
This corresponds to the following data block format:
25.41;10.23;320
25.43;10.23;300
25.39;11.51;310
This is compatible with the CSV format and is often used for compatibility
with other software.
So...the spec seems to be pretty clear about using XML entities to specify
special characters. Kyle, can you run the response through an unencoder before
feeding it to module-csv?
Original comment by sh...@axiomalaska.com
on 23 Oct 2012 at 11:26
I don't think we need to make a recommendation here. Whatever is in the
blockSerperator is what should actually be separating the blocks, and that is
covered in the SWE spec.
If the blockSeparator is "
" then the data should be seperate by "
" and not "\n".
As long as they are consistent the parsing is not an issue.
Original comment by wilcox.k...@gmail.com
on 24 Oct 2012 at 3:14
Hi Kyle, and congrats. Hope all is well.
This stuff gets confusing since we're talking about both encoded or escaped
representations of separators and string literals. What I think you're saying
is that parsers should consider whatever is specified as a blockSeparator (or
any other separator in swe:TextEncoding) as a string literal and look for
exactly that literal when parsing the swe:values. That's not what I'm seeing in
the SWE 2.0 spec; the spec indicates that XML entities specified as separators
in swe:TextEncoding should be resolved to the target ASCII character, and that
this resolved string should be used when parsing swe:values. The spec says
nothing about backslash-escaped strings like \n.
That's not an endorsement of the approach, but that's how I'm reading the SWE
2.0 spec (section 8.5.2 pg 107 and all examples in the doc). Maybe I'm missing
something.
Original comment by sh...@axiomalaska.com
on 24 Oct 2012 at 5:11
Adding some discussion from the mailing list:
Eric Bridger:
> I've changed my mind about this issue.
> I ran some tests using the Perl binding to C LibXML, lxml is the python
binding to
> the same library.
> Used: GO-Station-SingleFT-timeSeries-MultiSensor.xml as the starting point.
> BOTH blockSep="\n" and blockSep="
" work on swe2:values where \n is the block
> separator.
> Interestingly it also works where swe2:values uses
as the block separator.
> AND with blockSeparator="\n" and
in swe2:values
> I can share this code and example files if wanted.
Derrick Snowden:
> Not sure I remember what your first impression was. Can you state what your
> recommendation is based on your test?
Eric Bridger:
> Basically that blockSeparator="
" should work fine, both with minidom and other
> XML parsers.
> The <swe2:values> string can look like what Shane found in the SWE 2.0 spec
> examples, note the rows have newlines at the end of each line.
> AFAICT blockSeparator="\n" will also work for most parsers, but not python
minidom,
> as Mike discovered.
Original comment by sh...@axiomalaska.com
on 26 Oct 2012 at 9:29
After some thinking, I do agree with Kyle that we shouldn't make a rule
specifying which separators to use; there's no real reason for us to go outside
of the spec here, and if we don't have to we shouldn't. Clients should
preprocess documents if they need to convert to something friendly to their
parsing library.
BUT, also in keeping with the spec, we should use encoded XML entities (like
) when specifying special characters like linefeeds in the separators, instead
of using backslash-escaped characters.
Side note: since the spec says that encoded XML entities in the separators
should be unencoded before parsing the swe:values, we can't use entities
representing characters that are illegal in XML documents like &, <, >, etc,
since this would require putting those illegal values directly in swe:values.
At least that's how I interpret it.
We need to push to resolve this issue as it's one of the thing holding up the
completion of Milestone 1.0.
Original comment by sh...@axiomalaska.com
on 26 Oct 2012 at 9:51
+1. I think the confusion arose because Kyle said if you use blockSeparator="
" then the swe:values element must also contain literal '
's unless I misunderstood him.
So I agree, no rule specifying what separator to use, but the template examples
should use "
"
Original comment by ebrid...@gmri.org
on 27 Oct 2012 at 3:20
Suggest we close this issue. Unless there is a dissenting opinion I'll close
it on 1/7/2013.
Summary:
no rule specifying what separator to use, but the template examples should use "
"
BUT, also in keeping with the spec, we should use encoded XML entities (like
) when specifying special characters like linefeeds in the separators, instead
of using backslash-escaped characters.
Side note: since the spec says that encoded XML entities in the separators
should be unencoded before parsing the swe:values, we can't use entities
representing characters that are illegal in XML documents like &, <, >, etc,
since this would require putting those illegal values directly in swe:values.
At least that's how I interpret it
Original comment by dpsnowde...@gmail.com
on 3 Jan 2013 at 5:08
Sorry to be a contrarian, but I'd like to tweak Derrick's summary (which is the
consensus opinion so far):
no rule specifying what separator to use, but the template examples should use "
"
I'd add this qualifier: "and this convention is strongly recommended"
I think my difference in perspective is that I'm not thinking exclusively of
hard-core developers and programmers (or pre-existing client libraries) as the
only audience. I'd like to see our responses be as friendly as possible to
sloppy programmers who want to maximize hard-coding. I like describing the
swe2:values block as basically "conventional CSV", except om:result must also
be parsed in order to get access at the header info and block interpretation.
I see this as being little different than saying that we strongly recommend the
use of decimalSeparator="." and tokenSeparator="," in swe2:TextEncoding:
<swe2:TextEncoding decimalSeparator="." tokenSeparator="," blockSeparator="
"/>
eg, do we seriously want to allow IOOS SOS servers to choose to use, say,
decimalSeparator=","?? The hispanic in me says that'd be pretty cool (both ,
and . are in common usage in Spanish-speaking countries, depending on the
degree of european vs. american influence), but the pragmatic programmer in me
says that'd be f**d up, even if fully legal and standard-compliant.
Finally, just to make sure the recommendation being made is 100% clear, are we
saying that:
- IF you use blockSeparator="
" in swe2:TextEncoding:
<swe2:TextEncoding decimalSeparator="." tokenSeparator="," blockSeparator="
"/>
- THEN also use "
" (not \n) as the block separator in swe2:values
??
Based on Shane's Comment 13 and other comments, is there anything wrong with
saying we strongly recommend using blockSeparator="
" in swe2:TextEncoding, and \n as the block separator in swe2:values?
Original comment by emilioma...@gmail.com
on 7 Jan 2013 at 1:59
Generally I've wanted to avoid making rules beyond the existing specs when we
don't have to...but in essence this whole IOOS SOS standards effort is an
exercise in limiting (and augmenting) the existing SOS spec to make things
easier and less ambiguous for servers and clients. So, although I lean toward
not mandating the separators, I don't really care that much and am fine with
strong encouragement.
The encoded and unencoded string literal separators have been a source of great
confusion during this issue. The recommended rules are:
Use XML entities in TextEncoding to represent special characters. Use of
standard CSV separators is strongly recommended:
<swe2:TextEncoding decimalSeparator="." tokenSeparator="," blockSeparator="
"/>
Use unencoded separators (e.g. actual line feeds) in the values block, example:
25.41,10.23,320
25.43,10.23,300
25.39,11.51,310
I want to stay away from saying "use \n in swe:values" because this could be
interpreted as using the string literal "\n". which is not correct.
Closing tomorrow afternoon pending protests.
Original comment by sh...@axiomalaska.com
on 10 Jan 2013 at 10:25
Added to rules wiki, closing.
http://code.google.com/p/ioostech/wiki/SOSGuidelines_final#Observation_values_Te
xtEncoding_separators
Original comment by sh...@axiomalaska.com
on 18 Jan 2013 at 9:30
Original issue reported on code.google.com by
mike.gar...@gmail.com
on 16 Oct 2012 at 5:32