ikorin24 / U8XmlParser

Extremely fast UTF-8 xml parser library
MIT License
93 stars 13 forks source link

FormatException thrown when parsing DOCTYPE #26

Closed dreasgrech closed 2 years ago

dreasgrech commented 2 years ago

I am currently trying to parse this xml file:

<?xml version="1.0"?>
<!DOCTYPE datafile PUBLIC "-//FB Alpha//DTD ROM Management Datafile//EN" "http://www.logiqx.com/Dats/datafile.dtd">

<datafile>
</datafile>

A FormatException is being thrown when parsing the DOCTYPE element.

System.FormatException
  HResult=0x80131537
  Message=Exception of type 'System.FormatException' was thrown.
  Source=U8XmlParser
  StackTrace:
   at U8Xml.XmlParser.<TryParseDocType>g__SkipUntil|17_0(Byte ascii, RawString data, Int32& i)
   at U8Xml.XmlParser.TryParseDocType(RawString data, Int32& i, Boolean hasNode, OptionalNodeList optional, RawStringTable& entities)
   at U8Xml.XmlParser.StartStateMachine(RawString data, CustomList`1 nodes, CustomList`1 attrs, OptionalNodeList optional, RawStringTable& entities)
   at U8Xml.XmlParser.ParseCore(UnmanagedBuffer& utf8Buf, Int32 length)
   at U8Xml.XmlParser.ParseFileCore(String filePath, Encoding encoding)
   at U8Xml.XmlParser.ParseFile(String filePath, Encoding encoding)
   at U8Xml.XmlParser.ParseFile(String filePath)

Tested on U8XmlParser v1.5.0

ikorin24 commented 2 years ago

Thanks for the report.

I have been aware of this issue for some time. It is currently not implemented correctly. I am reluctant to solve this problem correctly.

In order to resolve DTD placed by URI on external servers, etc., it needs to communicate with them using HttpClient or similar. I am providing this library as a simple xml parser and do not want to add extra library dependencies if at all possible.

It is possible to fix the error by skipping the DTD definition, but it does not solve the underlying problem. (Especially if the DTD contains definitions of ENTITY, i.e., character aliases, the xml will not be interpreted correctly.)

dreasgrech commented 2 years ago

So is there currently a way to programmatically skip the DTD definition when parsing and specifying it manually instead, or at least determine if it's there and skip it and not throw a FormatException?
Maybe some form of "header pre-parsing" before actually digging down to the real parsing to determine that there is an external DTD?

Because it would be a real shame to have it just fail and throw an exception whilst parsing a valid XML file.

ikorin24 commented 2 years ago

Because it would be a real shame to have it just fail and throw an exception whilst parsing a valid XML file.

Yes, I think so, too. The library should provide a way to skip it so that the exception is not thrown in some way. Currently, FormatException is always thrown when an external DTD is present.

This would require modifying the library.

dreasgrech commented 2 years ago

Can you suggest a temporary workaround modification I can make to the XmlParser.cs file to somehow skip parsing the DOCTYPE altogether please? Maybe a change to the method TryParseDocType()?

Something like SkipUntil((byte)'>', data, ref i); if the xml file is using an external DTD.

It would be fantastic if this can somehow be done because it would still make this library usable for me.

dreasgrech commented 2 years ago

I wrote this temporary workaround which should be enough until an official fix is released.

This is my new code that reads the xml file:

bool tryExternalDocType = false;
XmlObject xml = null;
try
{
    XmlParser.containsExternalDoctype = false;
    xml = XmlParser.ParseFile(filePath);
}
catch (FormatException)
{
    tryExternalDocType = true;
}

if (tryExternalDocType)
{
    XmlParser.containsExternalDoctype = true;
    xml = XmlParser.ParseFile(filePath);
}

Added a public field in XmlParser.cs and using it in the TryParseDocType method.

public static bool containsExternalDocType;

private static bool TryParseDocType(RawString data, ref int i, bool hasNode, OptionalNodeList optional, ref RawStringTable entities) 
{
    .
    .
    .

    if(SkipEmpty(data, ref i) == false) { throw NewFormatException(); }

    /** New code: *********/
    if (containsExternalDocType)
    {
        SkipUntil((byte)'>', data, ref i);
        return true;
    }
    /***********************/

    var nameStart = i;
    .
    .
    .
}
ikorin24 commented 2 years ago

That code is not good and cannot be added to the library as is, even temporarily.

However, I will provide some alternative implementation. I'm working hard on it now.

dreasgrech commented 2 years ago

It was not intended to be added to the library but as a temporary workaround for my needs as I can now handle both files with inline doctypes but also skip the external doctypes, without having to change the entire library. It currently seems to be producing the results I was looking for.

Looking forward to the alternative implementation.