microsoft / XmlNotepad

XML Notepad provides a simple intuitive User Interface for browsing and editing XML documents.
https://microsoft.github.io/XmlNotepad/
MIT License
968 stars 207 forks source link

Incomplete schema validation of large XML files ( > ~20 MB) #329

Open AikenBM opened 12 months ago

AikenBM commented 12 months ago

I'm using XML Notepad 2.9.0.5 on Windows 10 Enterprise 22H2 19045.3208.

I've discovered that XML Notepad's validation gives up around the 20 MB mark. The application will end with displaying a line number a column number of 0 in the error list. The program will then stop validating any further schema errors.

I am attaching a zipped 42 MB XML file that contains 100 schema validation errors. This example uses sample data from the state of Michigan's Department of Education state reporting system because that's what I was doing when I found the problem.

SchemaErrors.zip

XML Notepad validates and identifies the first 43 or so errors, but the last one listed doesn't appear to populate the table the same way, and it stops after that error. Even exporting the list doesn't show the remaining errors. Here's a screenshot of the error list:

image

Based on my incidental testing, any schema validation error after roughly the 20 millionth character or 565,000th line in the file will fail in this way. Only the first error in that range will show, in the error list, and the error list will not display accurate information. I don't see any setting in the application's options to increase this apparent limitation.

I've also included the following Powershell function below which uses System.Xml.XmlReader to validate the schema, and it correctly identifies all 100 schema errors.

function Validate-XmlFile {
    [CmdletBinding()]
    param (
        # The path to the XML file
        [Parameter(Mandatory = $true, Position = 0)]
        [String]
        $Path,

        # Instead of outputing the warnings, just return true if valid and false if invalid
        [Switch]
        $IsValid,

        # Force the process to download the schema again instead of potentially using a cached version
        [Switch]
        $ForceSchemaRefresh
    )

    process {
        $XmlFileName = Get-Item $Path -ErrorAction Stop

        $XmlReaderSettings = New-Object -TypeName System.Xml.XmlReaderSettings
        $XmlReaderSettings.ValidationType = [System.Xml.ValidationType]::Schema
        $XmlReaderSettings.ValidationFlags = ([System.Xml.Schema.XmlSchemaValidationFlags]::ProcessInlineSchema -bor
            [System.Xml.Schema.XmlSchemaValidationFlags]::ProcessSchemaLocation -bor
            [System.Xml.Schema.XmlSchemaValidationFlags]::ReportValidationWarnings -bor
            [System.Xml.Schema.XmlSchemaValidationFlags]::ProcessIdentityConstraints)

        $XmlUrlResolver = [System.Xml.XmlUrlResolver]::new()
        # Some versions of Powershell require credentials. Use anonymous credentials to satisfy the class.
        $XmlUrlResolver.Credentials = [System.Net.NetworkCredential]::new('anonymous','anonymous@example.com')
        $XmlUrlResolver.CachePolicy = [System.Net.Cache.RequestCacheLevel]::Revalidate
        if ($true -eq $ForceSchemaRefresh) {
            $XmlUrlResolver.CachePolicy = [System.Net.Cache.RequestCacheLevel]::Reload
        }
        $XmlReaderSettings.XmlResolver = $XmlUrlResolver

        # Create the validation handler to capture the validation errors and warnings
        $script:ValidationOutput = [System.Collections.Generic.List[String]]::new()
        $ValidationEventHandler = [System.Xml.Schema.ValidationEventHandler] {
            # $_ is the second argument of type System.Xml.ValidationEventArgs
            $script:ValidationOutput.Add(("{0} on line {2}: {1}" -f $_.Severity, $_.Message, $_.Exception.LineNumber))
        }
        $XmlReaderSettings.add_ValidationEventHandler($ValidationEventHandler)

        try {
            $XmlReader = [System.Xml.XmlReader]::Create($XmlFileName.FullName, $XmlReaderSettings);
            [System.Xml.XmlDocument]::new().Load($XmlReader)
            $XmlReader.Dispose()
            if (!$IsValid) {
                if ($script:ValidationOutput.Count -eq 0) {
                    Write-Host "No validation errors in file '$($XmlFileName.FullName)'."
                }
                else {
                    # Write-Host "Validation errors written to '$ValidationErrorFile'"
                    # $script:ValidationOutput | Set-Content -Path $ValidationErrorFile -Encoding ascii
                    $script:ValidationOutput
                    Write-Warning ("{0:n0} errors detected in '{1}'" -f $script:ValidationOutput.Count, $XmlFileName.FullName)
                }
            }
            else {
                if ($script:ValidationOutput.Count -eq 0) {
                    return $true
                }
                else {
                    return $false
                }
            }
        }
        finally {
            $XmlReader.Dispose()
        }
    }
}

Note that I typically use Powershell v7.3 with the above function. I'm not sure if it still works with Windows Powershell v5.1.

lovettchris commented 12 months ago

Very excellent bug report, I'll check it out, thanks.

gitnol commented 3 months ago

I want to add, that schema validation within xml notepad below 0x20 is not correct or seems not to fully work as been defined here: https://www.w3.org/TR/2006/REC-xml-20060816/#NT-Char Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] / any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. /

i had a file in which 0xE; was found:  26 23 78 45 3B -> https://en.wikipedia.org/wiki/Shift_Out_and_Shift_In_characters image

but the character 0x1F; (US = Unit Separator) was not found invalid in the same file  26 23 78 31 46 3B -> https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Field_separators image

I hope this helps fixing the schema validation