Broken parsing of CLDR files

neos / flow-development-collection

The unified repository containing the Flow core packages, used for Flow development.

https://flow.neos.io/

MIT License

135 stars 188 forks source link

Broken parsing of CLDR files #2364

Open rkrahnen opened 3 years ago

rkrahnen commented 3 years ago

I think the CldrParseris broken for at least supplementalData.xml. And because of that the CurrencyReader is broken too. The function generateFractions ist writing an array with just one entry:

array(1)
 integer 0 => array(2)
   'digits' (6) => integer 0
   'rounding' (8) => integer 0

I am able to make it a bit better by fixing the foreach inside that function. Current:

foreach ($currencyData['fractions'] as $fractionString) {
    $currencyCode = $model->getAttributeValue($fractionString, 'iso4217');
    $this->fractions[$currencyCode] = [
        'digits' => (int)$model->getAttributeValue($fractionString, 'digits'),
        'rounding' => (int)$model->getAttributeValue($fractionString, 'rounding')
    ];
}

Fixed:

foreach ($currencyData['fractions'] as $fractionString => $_) {
    $currencyCode = $model->getAttributeValue($fractionString, 'iso4217');
    $this->fractions[$currencyCode] = [
        'digits' => (int)$model->getAttributeValue($fractionString, 'digits'),
        'rounding' => (int)$model->getAttributeValue($fractionString, 'rounding')
    ];
}

The value $_ is just an empty string, because the xml tag has no value. And the attributes are provided inside the key $fractionString. The created array has 62 entries now, but all entries look the same:

array(62)
 'ADP' (3) => array(2)
   'digits' (6) => integer 0
   'rounding' (8) => integer 0
 'AFN' (3) => array(2)
   'digits' (6) => integer 0
   'rounding' (8) => integer 0
 'ALL' (3) => array(2)
   'digits' (6) => integer 0
   'rounding' (8) => integer 0
...

It is always two zeros. This is because the $fractionString does not contain all attributes, only the distinguished ones. Which is only the iso4217 in that case. The other attributes seem to be lost. But they are needed here.

I found that in Flow 5.3.

Currently I am not able to make a solution because I am not knowing what is your intention of distinguishing attributes. For me the parser looks broken as it only saves these and looses all other attributes.

kdambekalns commented 3 years ago

I aam pretty sure it worked "back then". One thing to investigate is the updates done to the XML over time, maybe there have been changes that broke the parsing…

kdambekalns commented 3 years ago

@kdambekalns bookmark

rkrahnen commented 3 years ago

Could you maybe explain what the need for distinguishing attributes is? For me all attributes a tag has should be relevant. As far as i can see this list of distinguishing attributes filters the attributes and all other attributes are not saved to the result.

rkrahnen commented 3 years ago

@kdambekalns To be able to provide a solution I need an answer to my question. :-)

albe commented 3 years ago

One thing to investigate is the updates done to the XML over time, maybe there have been changes that broke the parsing…

I looked back to the file from 2008 and the format of the fractions haven't changed until now, so that is not the issue. The errorenous behaviour seems to indeed stem from the isDistinguishingAttribute() check, which will lead to not all attributes being parsed. That change comes from 327f3fdcf3b41d0637d5f61d38eb7108721620ed, which is from 2010. We then consecutively added more and more attributes to the list, extending the CLDR definition of "distinguishing attribute" to make some use cases work again (like locales, see c5cf2fa7d7528b517f6aa9f25f83fa9397cb568a and 3772af7849f13b73ebd186c6524df4f3fad342d2). I think this is wrong.

Unfortunately, our CurrencyReaderTest just assumes that the information is parsed into 'info[@iso4217="CHF"][@digits="2"][@rounding="5"]' and tests based on that. In fact, the fractions infos will be parsed into ['info[@iso4217="CHF"]' => ''] - so the digits and rounding attributes are completely lost. The reason for the "distinguishing attributes" seems to have a minimal unique identifier for the element as the array key. I think the check should rather make sure that the element contains at least one distinguishing attribute, rather than filter by them.

So either we just parse all attributes into the array key (and maybe skip elements that do not have distinguishing attributes), or we change the parsed structure such that non-distinguishing attributes are somehow stored in the array value (though I'm unsure how and this would be pretty breaky, hence I'm in favor of the former).

rkrahnen commented 3 years ago

I think the check should rather make sure that the element contains at least one distinguishing attribute, rather than filter by them.

I agree with you that we should not filter by them, but I even would not check for at least one distinguishing attribute. I would expect the parser to give me as much information as possible. If we decide to check for existing attributes, we maybe filter entries that someone else would like to get.

The only problem I see is the case, where all attributes in two entries are equal and lead to the same identifier in the array. Currently you are skipping entries with already existing distinguishing attributes:

if (!isset($parsedNode[$nameOfChild])) {
    // We accept only first child when they are non distinguishable (i.e. they differs only by non-distinguishing attributes)
    $parsedNode[$nameOfChild] = $parsedChild;
}

This could stay as it is now or we could add a counter to $nameOfChild, something like this:

foreach ($node->children() as $counter => $child) {
    $nameOfChild = $child->getName() . '[counter=' . $counter . ']';

    $parsedChild = $this->parseNode($child);

    if (count($child->attributes()) > 0) {
        $parsedAttributes = '';
        foreach ($child->attributes() as $attributeName => $attributeValue) {
            if ($this->isDistinguishingAttribute($attributeName)) {
                $parsedAttributes .= '[@' . $attributeName . '="' . $attributeValue . '"]';
            }
        }

        $nameOfChild .= $parsedAttributes;
    }
}

Because all attributes have an @ as prefix there would be no conflict with an attribute named counter. Of cause this would be an additional information that is not existing in the parsed XML itself, but it would solve the duplication problem.

albe commented 3 years ago

I agree with you that we should not filter by them, but I even would not check for at least one distinguishing attribute. I would expect the parser to give me as much information as possible. If we decide to check for existing attributes, we maybe filter entries that someone else would like to get.

It's more about detecting cases where we actually loose entries that someone would like to get. But yes, we could also do that in the else clause of if (!isset($parsedNode[$nameOfChild])) { and even then we still need to decide what to do other than silently skip... So the foremost question is, are there actually any cases of duplicate nodes in the CLDR?

The idea with the counter might work, though I'm unsure how to correctly query this. I mean, no one will probably know the order in which the elements in the CLDR occur. Also note that the getNodeName() method of the CLDR Model checks for the first occurence of an [@ attribute, so this would break:

public static function getNodeName(string $nodeString): string
    {
        $positionOfFirstAttribute = strpos($nodeString, '[@');

Sidenote: did we ever check if our custom implementation of parsing the CLDR and rebuilding a kind-of XPath equivalent was worth it in regards to performance vs. maintenance? I assume our implementation is faster than SimpleXmlElement + XPath query, at least once the parsed structure is cached. But that's just an assumption.

rkrahnen commented 3 years ago

So the foremost question is, are there actually any cases of duplicate nodes in the CLDR?

Maybe not in the data we currently use, but in general there are cases like this (characters.xml):

<character value = "―">
    <substitute>—</substitute>
    <substitute>-</substitute>
</character>

Also note that the getNodeName() method of the CLDR Model checks for the first occurence of an [@ attribute, so this would break...

Good catch, we would have to change that method accordingly. But as this function is not marked with @api it should not be a problem in a bugfix release? If we would remove the whole distinguishing attribute logic and add a counter instead, 'info[@iso4217="CHF"][@digits="2"][@rounding="5"]' would become something like 'info[counter="0"][@iso4217="CHF"][@digits="2"][@rounding="5"]' and the getNodeName() could check for the first occurrence of [. As far as i know that is a character that is not in element names.

albe commented 3 years ago

Maybe not in the data we currently use, but in general there are cases like this (characters.xml):
<character value = "―">
<substitute>—</substitute>
<substitute>-</substitute>
</character>

So the question is: How should this be used? How should the path query look to retrieve the one or the other. If we follow XPath, it should be /character/substitute[1] to select the first and /characters/subtitute[2] to select the second. So given that, maybe we should instead choose $nameOfChild = $child->getName() . '[' . ($counter + 1) . ']';, but only for elements that have otherwise exactly matching siblings (so needs to do a look-ahead, or post-process keys once a first duplicate is found).

After taking a look at the "querying" in the CLDRModel - it is pretty naiive - I think blindly adding ALL attributes to the node name will make querying harder (and possibly break things):

    public function getRawData(string $path)
    {
        if ($path === '/') {
            return $this->parsedData;
        }

        $pathElements = explode('/', trim($path, '/'));
        $data = $this->parsedData;

        foreach ($pathElements as $key) {
            if (isset($data[$key])) {
                $data = $data[$key];
            } else {
                return false;
            }
        }

You'd then need to specify every single existing attribute in the CLDR for all nodes if you want to query, e.g. /localeDisplayNames/languages/language[@type="zh_Hans"][@alt="long"][@draft="contributed"] for the locale <language type="zh_Hans" alt="long" draft="contributed">Simplified Mandarin Chinese</language>, while for <language type="zh_Hans">Simplified Chinese</language> it is /localeDisplayNames/languages/language[@type="zh_Hans"] - the problem I see is the newly required draft attribute. So I conclude that this is the actual reason for the distinguishing attributes.

Soooo... we've come full circle then, because we do need the attributes in the name to actually retrieve them, but we should only add the minimal required to keep the path expressions sensible (the use case in this issue is a worst-case example - the attribute contains the actual information, the node content is empty). Hence the right solution would be to extend the parsed data format, or write a much more complex path following logic. OR check if we can't simply use XPath directly (check performance penalty).

rkrahnen commented 3 years ago

Then maybe we should go with a fix for the broken currency reader first? Like for other attributes we add the required ones for the currency reader in isDistinguishingAttribute():

$distinguishingAttributes[] = 'digits';
$distinguishingAttributes[] = 'rounding';

And then we do the fix for the CurrencyReader from my first comment.

Everything else seems to be harder than I expected first and may break, so I would love to have the bugfix first and maybe come back again to this topic for Flow 8.0?