htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.71k stars 417 forks source link

Tidy Hangs in PPrintText when outputting a specific html file to xml #1034

Open qwyt opened 2 years ago

qwyt commented 2 years ago

Tidy seems to get stuck for some reasons when parsing this website: https://www.vanticatrading.com/post/germany-crypto-held-for-more-than-a-year-will-not-get-taxed and 'tidySaveBuffer' never finished/

Tried pausing the proccess a bunch of times it always ends up somehwere in:

... PPrintText pprint.c:1371 prvTidyPPrintXMLTree pprint.c:2585 tidyDocSaveStream tidylib.c:2328 tidyDocSaveBuffer tidylib.c:1391 tidySaveBuffer tidylib.c:1257 main main.cpp:40 start 0x00000001008dd088

Made a repro project here: https://github.com/qwyt/tidy-html-custom-tags-bug.

To Reproduce:

  1. Set up the project using the included CMakeLists file.
  2. Run it (main.cpp)
  3. Observe that tidy never exits (I didn't try waiting for more than a few minutes).

Not sure what's causing this. Will try to investigate it at some later point unless someone has any ideas.

qwyt commented 2 years ago

Does not happen when 'TidyXmlOut' is False.

qwyt commented 2 years ago

Something seems to be wrong in: ` do { TidyParserMemory memory = TY_(popMemory)(doc); Node* close_node = memory.original_node; Bool mixed = memory.register_1; indent = memory.register_2;

            if ( !mixed && close_node->content )
                PCondFlushLineSmart( doc, indent );

            PPrintEndTag( doc, mode, indent, close_node );
            /* PCondFlushLine( doc, indent ); */

            node = memory.reentry_node;
        } while ( node == NULL && TY_(isEmptyParserStack)(doc) == no );

` print.c:2696

doc->stack.top never drops to 0, it just gets to 35, 33, 32, 31, 30 and jumps back to 35 forever.