sdmx-twg / sdmx-csv

This repository is used for maintaining the SDMX-CSV message specifications.
12 stars 6 forks source link

Escaping the character used as separator between dimension values in a key #28

Open sosna opened 1 year ago

sosna commented 1 year ago

The field guide states the following about keys (the highlight is mine):

The column will contain the combination of IDs/values for all the dimensions, order by their order in the data structure definition and separated by a dot character (.), e.g. M.USD.EUR.SP00.2020-01

However, the guide does not specify what to do in case a dimension value contains a dot, i.e. the character to be used as separator. This cannot be the case in case of coded dimensions (as the . is not an allowed character for an SDMX code), but could be the case if:

So, in case a dimension value contains a dot, what should the service provider do, when building the series and/or observation keys if SDMX-CSV data messages?

Thank you.

dosse commented 1 year ago

Indeed, if a dimension is non-coded without any restrictions, then any separation character could clash with arbitrary characters used in dimension values. A suitable solution seems being to escape the related dimension values using the classical CSV escaping mechanism, which is encapsulating problematic strings into 2 surrounding double-quotes ", e.g., M.ABC."A string with . (dot) and "" (double-quotes)".2020-01 Note that double-quotes themselves are escaped in CSV by doubling them.

Note also that for consistency, the key specification of SDMX-CSV aligned to the key specification of the SDMX rest syntax in https://github.com/sdmx-twg/sdmx-rest/blob/master/doc/data.md. What would be the solution for this issue in the SDMX rest syntax?

sosna commented 1 year ago

Thanks a lot, @dosse! Maybe this could be added as an example to the SDMX-CSV guide? For REST, I guess we would need to investigate this separately, as we have the additional restriction of the characters that may be used as path or query parameters.

dosse commented 9 months ago

@sosna Thanks! Maybe the CSV solution should then wait to see if things can be aligned between REST and CSV?

egreising commented 9 months ago

Interesting discussion, @sosna and @dosse. What about using the encoding as in URLs? image

It's not very user friendly, but if it only used as a way of escaping problematic characters it might be probably reduced to the dot (.) as %2E, the double quotes (") as %22, the comma (,) as %2C and very few more, so they will soon become well-known codes.

sosna commented 9 months ago

@dosse: Thanks. Yes, sure, we can park it for the time being, if this is your preference.

@egreising: Thanks. I think this would solve the problem for SDMX-CSV indeed, but maybe not for SDMX-REST, as Jens pointed out? I think browsers will typically send query strings and path parameters as percent-encoded values? If this is so, then, I guess it would not help, i.e. how could we distinguish between a %2E that is used as key separator and a %2E that is used as normal character in an uncoded dimension value?