improved getHTML (Trac #253)

sleemanj commented 19 years ago

I have been developing a plugin for htmlarea, and I'm considering to port it to Xinha, since the development of htmlarea has stalled. While working with this [url=http://www.kyberfabrikken.dk/opensource/indite/]plugin[/url] I also had to do a few patches/hacks on htmlarea and thus I pretty much knows the insides of htmlarea. I don't really have the time currently to opt in on developing on xinha, but I have a few suggestions for improvements. For one thing, I had to pull out xhtml-wellformed content in a more effecient way. Using javascript to traverse through the dom and build the markup is quite cpu-hungry. The solution I came up with was to use an arsenal of regex'es to correct the html output into xhtml. The speed-improvement is considerable (more than 1:1000). I also made a function for correcting idention of sourcecode. Theese functions could quite easily be integrated into xinha, by substituting HTMLArea.getHTML() / HTMLArea.getHTMLWrapper()

The source in mind can be found by downloading the Indite plugin - You should use the file xml/XML_Utility.js and the functions XML_Utility.cleanHTML() and XML_Utility.indent(), witch takes the "raw" markup from editor.getInnerHTML(). Works with mozilla and IE.

_Reported by troelskn, migrated from http://trac.xinha.org/ticket/253

sleemanj commented 19 years ago

Attachment: XML_Utility.js

sleemanj commented 19 years ago

wymsy commented:

A nice piece of work, however I have found a couple of problems. Testing in firefox (haven't got to IE yet), regexp 00 not only lowercases tags and attribute names, it also lowercases all text content. Similarly, regexp 02 also put quotes around text following an = sign in the text content. The program needs to be modified to only process tags and not the rest of the page contents.

Also, I had to modify the indent function to replace \n's with a space instead of removing them, otherwise words were running together.

sleemanj commented 19 years ago

wymsy commented:

I've got it working quite well now - changed the way I was calling the RE's so only tags are processed. I need to test it some more, especially look at how paths are handled by the different browsers, but if all goes well I'll post my code soon.

Indent, which is a completely separate function, works beautifully in both browsers. I made a couple of tweaks, and I'll post that shortly also.

sleemanj commented 19 years ago

wymsy commented:

I've been testing this for over a week now, and it was looking good until I tried it with a Flash Movie plugin I have been working on. The plugin inserts an tag, in the usual way for Flash movies. It works well in HTMLArea 3 and in unmodified Xinha. The problem is that with the modified code below, the object and param tags tend to disappear, leaving only the embed tag! It happens after I load the editor with a page already containing the object tags. The editor loads correctly, which I verified with an alert at the end of initIframe, but then after anything that invokes innerHTML the object tag is gone.

I haven't been able to prove whether this is a problem in Xinha or in IE, or possibly just my copy of IE, so I am posting the code in case someone else would like to try it.


HTMLArea.getHTMLWrapper = function(root, outputRoot, editor, indent) {
  var html = "";
//  if(!indent) indent = '';
  switch (root.nodeType) {
    case 10:// Node.DOCUMENT_TYPE_NODE
    case 6: // Node.ENTITY_NODE
    case 12:// Node.NOTATION_NODE
      // this all are for the document type, probably not necessary
      break;

    case 2: // Node.ATTRIBUTE_NODE
      // Never get here, this has to be handled in the ELEMENT case because
      // of IE crapness requring that some attributes are grabbed directly from
      // the attribute (nodeValue doesn't return correct values), see
      //http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&safe=off&selm=3porgu4mc4ofcoa1uqkf7u8kvv064kjjb4%404ax.com
      // for information
      break;

    case 4: // Node.CDATA_SECTION_NODE
      // Mozilla seems to convert CDATA into a comment when going into wysiwyg mode,
      //  don't know about IE
      html += (HTMLArea.is_ie ? ('\n' + indent) : '') + '<![CDATA[' + root.data + ']]>' ;
      break;

    case 5: // Node.ENTITY_REFERENCE_NODE
      html += '&' + root.nodeValue + ';';
      break;

    case 7: // Node.PROCESSING_INSTRUCTION_NODE
      // PI's don't seem to survive going into the wysiwyg mode, (at least in moz)
      // so this is purely academic
      html += (HTMLArea.is_ie ? ('\n' + indent) : '') + '<?' + root.target + ' ' + root.data + ' ?>';
      break;

      case 1: // Node.ELEMENT_NODE
      case 11: // Node.DOCUMENT_FRAGMENT_NODE
      case 9: // Node.DOCUMENT_NODE
      {
    var closed;
    var i;
    var root_tag = (root.nodeType == 1) ? root.tagName.toLowerCase() : '';
 //   if (root_tag == 'br' && !root.nextSibling)
 //     break;
    if (outputRoot)
      outputRoot = !(editor.config.htmlRemoveTags && editor.config.htmlRemoveTags.test(root_tag));

    if (outputRoot) {
      closed = (!(root.hasChildNodes() || HTMLArea.needsClosingTag(root)));
      html += "<" + root.tagName.toLowerCase();
      var attrs = root.attributes;
      for (i = 0; i < attrs.length; ++i) {
        var a = attrs.item(i);
        if (!a.specified && !(root.tagName.toLowerCase().match(/input|option/) && a.nodeName == 'value')) {
          continue;
        }
        var name = a.nodeName.toLowerCase();
        var value;
        if (name != "style") {
          if (typeof root[a.nodeName] != "undefined" && name != "href" && name != "src" && !/^on/.test(name)) {
            value = root[a.nodeName];
          } else {
            value = a.nodeValue;
            // IE seems not willing to return the original values - it converts to absolute
            // links using a.nodeValue, a.value, a.stringValue, root.getAttribute("href")
            // So we have to strip the baseurl manually :-/
            if (HTMLArea.is_ie && (name == "href" || name == "src")) {
              value = editor.stripBaseURL(value);
            }
          }
        } else { // IE fails to put style in attributes list
          // FIXME: cssText reported by IE is UPPERCASE
          value = root.style.cssText;
        }
        html += " " + name + '="' + HTMLArea.htmlEncode(value) + '"';
      }
      if (html != "") {
        html += closed ? " />" : ">";
      }
  }
      html += editor.getInnerHTML().replace(/<[^>]*>/gi, function($1){return XML_Utility.cleanHTML($1,false)});
      if (outputRoot && !closed) {
          html += "</" + root.tagName.toLowerCase() + ">";
      }
      html = XML_Utility.indent(html);
  }
    break;
      case 3: // Node.TEXT_NODE
    html = /^script|style$/i.test(root.parentNode.tagName) ? root.data : HTMLArea.htmlEncode(root.data);
    break;

      case 8: // Node.COMMENT_NODE
    html = "<!--" + root.data + "-->";
    break;
  }
  return html;
};
/*****************************************/
XML_Utility = {};

XML_Utility.RegExpCache = [
/*00*/ // new RegExp().compile(/[< ]+([^= ]+)/gi),//lowercase tags/attribute names DOESN'T WORK!!! lowercases content also!!
/*00*/  new RegExp().compile(/[< ]+([^= ]+)/gi),//lowercase tags/attribute names DOESN'T WORK!!! lowercases content also!!
/*01*/  new RegExp().compile(/(\S*\s*=\s*)?_moz[^=>]*(=\s*[^>]*)?/gi),//strip _moz attributes
/*02*/  new RegExp().compile(/\s*=\s*(['"])?(([^>" ]| (?=[^"=]+['"]))+)\1?/gi),//add attribute quotes
/*03*/  new RegExp().compile(/\/>/g),//strip singlet terminators
/*04*/  new RegExp().compile(/<(br|hr|img|input|link|meta|param|embed)([^>]*)>/g),//terminate singlet tags
/*05*/  new RegExp().compile(/(checked|compact|declare|defer|disabled|ismap|multiple|no(href|resize|shade|wrap)|readonly|selected)/gi),//expand singlet attributes
/*06*/  new RegExp().compile(/(="[^']*)'([^'"]*")/),//check quote nesting
/*07*/  new RegExp().compile(/&(?=[^<]*>)/g),//expand query ampersands
/*08*/  new RegExp().compile(/<\s+/g),//strip tagstart whitespace
/*09*/  new RegExp().compile(/\s+(\/)?>/g),//trim whitespace
/*10*/  new RegExp().compile(/\s{2,}/g),//trim extra whitespace
/*11*/  new RegExp().compile(/&\w*;/g),
/*12*/  new RegExp().compile(/^<body>\s*/gi),
/*13*/  new RegExp().compile(/\s*<\/body>/gi),
/*14*/  new RegExp().compile(/<\/?(div|p|h[1-6]|table|tr|td|th|ul|ol|li|blockquote|object|br|hr|img|embed|param)[^>]*>/g),
/*15*/  new RegExp().compile(/<\/(div|p|h[1-6]|table|tr|td|th|ul|ol|li|blockquote|object)( [^>]*)?>/g),//blocklevel closing tag
/*16*/  new RegExp().compile(/<(div|p|h[1-6]|table|tr|td|th|ul|ol|li|blockquote|object)( [^>]*)?>/g),//blocklevel opening tag
/*17*/  new RegExp().compile(/<(br|hr|img|embed|param)[^>]*>/g)//singlet tag
];

/** 
  * Cleans HTML into wellformed xhtml
  *
  * A much faster way of retrieving the html-source of the document than the default supplied by HtmlArea
  * mishoo should feel free to copy this to the main distribution
  * credits goes to adios, who helped me out with this one :
  * http://www.sitepoint.com/forums/showthread.php?t=201052
  */
XML_Utility.cleanHTML = function(sHtml, bReplaceEntities) {
        var c = XML_Utility.RegExpCache;

        sHtml = sHtml.
                replace(c[0](../commit/0), function($1) { return $1.toLowerCase(); } ).//lowercase tags/attribute names
                replace(c[1](../commit/1), ' ').//strip _moz attributes
                replace(c[2](../commit/2), '="$2"').//add attribute quotes
                replace(c[3](../commit/3), '>').//strip singlet terminators
                replace(c[9](../commit/9), '$1>').//trim whitespace
                replace(c[4](../commit/4), '<$1$2 />').//terminate singlet tags
                replace(c[5](../commit/5), '$1="$1"').//expand singlet attributes
                replace(c[6](../commit/6), '$1$2').//check quote nesting
                replace(c[7](../commit/7), '&').//expand query ampersands
                replace(c[8](../commit/8), '<').//strip tagstart whitespace
                replace(c[10](../commit/10), ' ');//trim extra whitespace
        if ((typeof(bReplaceEntities) == "boolean") ? bReplaceEntities : true) { // fix entities ? default = yes
                return XML_Utility.replaceEntities(sHtml);
        }
        return sHtml;
};

/**
  * Prettyfies html by inserting linebreaks before tags, and indenting blocklevel tags
  *
  * @todo    linebreaks are not preserved in preformatted tags, witch likely will cause trouble.
  *          some unmotivated extra whitespaces ends up at the end of lines. not really a problem, but
  *          annoying none the less.
  */
XML_Utility.indent = function(s, sindentChar) {
        XML_Utility.__nindent = 0;
        XML_Utility.__sindent = "";
        XML_Utility.__sindentChar = (typeof sindentChar == "undefined") ? "  " : sindentChar;
        var c = XML_Utility.RegExpCache;
        s = s.replace(/[\n\r]/gi, " ").replace(/\s+/gi," ").replace(c[14](../commit/14), function($1) {
                        if ($1.match(c[16](../commit/16))) {
                                var s = "\n" + XML_Utility.__sindent + $1;
                                // blocklevel openingtag - increase indent
                                XML_Utility.__sindent += XML_Utility.__sindentChar;
                                ++XML_Utility.__nindent;
                                return s;
                        } else if ($1.match(c[15](../commit/15))) {
                                // blocklevel closingtag - decrease indent
                                --XML_Utility.__nindent;
                                XML_Utility.__sindent = "";
                                for (var i=XML_Utility.__nindent;i>0;--i) {
                                        XML_Utility.__sindent += XML_Utility.__sindentChar;
                                }
                                return "\n" + XML_Utility.__sindent + $1;
                        } else if ($1.match(c[17](../commit/17))) {
                                // singlet tag
                                return "\n" + XML_Utility.__sindent + $1;
                        }
                        return $1; // this won't actually happen
                        });
        if (s.charAt(0) == "\n") {
                return s.substring(1, s.length);
        }
        return s;
};

It looks to me like a problem in IE in innerHTML, but I have not found any references to any known bugs like this. So it might just be something messed up in my PC. I did reinstall IE, with no effect. I can't find anything in Xinha to explain it, either. If anyone can reproduce the problem, or fail to, I'd be interested to hear.

sleemanj commented 19 years ago

anonymous commented:

This is one of the problems that ticket 287 documents and fixes, at least for older versions of XINHA

sleemanj commented 19 years ago

wymsy commented:

It's actually not quite the same. Ticket 287 fixes the problem of embed tags being lost when constructing the html from the DOM. What I'm seeing is the object and param tags being lost when using innerHTML.

Upon further testing, I am finding that I have the same problem in Xinha (version 193) with the regular getHTMLWrapper function, modified slightly with the essence of the 287 fix. So it has nothing to do with the code in this ticket, and it still may turn out to be some weirdness in my PC.

sleemanj commented 19 years ago

mharrisonline commented:

Wow, it works great! When I tested the example above in IE6 it improved the code for Flash and made it easier to read, and preserved the code for scripting and noscript.

The only problem I saw was that when you are in full HTML mode the body tag becomes:

sleemanj commented 19 years ago

mharrisonline commented:

One other possible problem, I had previously noticed that with the current download that if I replaced the HTMLArea.getHTMLWrapper with the one I had submitted in Ticket 287, the fix in ticket 127 no longer made HTMLArea.htmlEncode work. I was able to add the fixes in 287 to the current download's HTMLArea.getHTMLWrapper, and then the fix in ticket 127 was again able to convert symbols to HTML entities.

The same thing happens with this code, it makes the sooped-up HTMLArea.htmlEncode in ticket 127 unable to capture symbols from the CharacterMap plugin and convert them to entities. Some part of this modification probably needs to be updated to allow HTMLArea.htmlEncode to work properly again (at least when 127 is applied).

Except for that, I definitely like this better than what I submitted in 287.

sleemanj commented 19 years ago

wymsy commented:

The code submitted by troels_kn at the beginning of this ticket includes an encoding utility similar to ticket 127's which, for simplicity's sake, I did not include in my tests (yet). But the hooks are there, just change the second parameter passed to cleanHTML to true and copy the replaceEntities utility into xinha.js.

sleemanj commented 19 years ago

mharrisonline commented:

Hmmm, I tried what you described, but can't get it to keep the symbols as entities. I noticed another problem, you get a javascript error (line 4241, character 9) when you try to undo.

sleemanj commented 19 years ago

mharrisonline commented:

To be more exact, undo works with the code posted above on Jun 9, but if you go to Full Page mode, undo no longer works, and the body node becomes .

sleemanj commented 19 years ago

wymsy commented:

Ah, well, I just took a closer look at the two encoding functions, and they are not at all the same. The one in this ticket just translates named entities to the numeric representation, where the one in ticket 127 translates the actual character into the named entity. So to preserve symbols we need ticket 127. The encoding function in this ticket doesn't add anything particularly useful.

sleemanj commented 19 years ago

mharrisonline commented:

Do you think this could be made to work in Full Page Mode too?

sleemanj commented 19 years ago

wymsy commented:

One more regexp to strip out the contenteditable=true attribute would be a good place to start. I don't know if that would make undo work, but it's possible.

sleemanj commented 19 years ago

mharrisonline commented:

I'm feeling pretty regex challenged. I've been trying this for days, and no matter what I do, either contenteditable="true" reappears, or I get the message that I messed up the DOM.

The current HTMLArea.getHTMLWrapper catches contenteditable with:
  if (/(_moz)|(contenteditable)|(_msh)/.test(name)) {
          // avoid certain attributes
          continue;
        }
but even if I restore those lines it still keeps happening.

sleemanj commented 19 years ago

anonymous commented:

Great work on this guys, it's looking very promising!

sleemanj commented 19 years ago

mharrisonline commented:

Does anybody have a clue how to make this work in full page mode?

sleemanj commented 19 years ago

mharrisonline commented:

Whoops! It looks like this works fine (except for content=editable in the body), undo in Full Page mode is completely broken in Xinha period, it has nothing to do with this at all.

sleemanj commented 19 years ago

mharrisonline commented:

...and it does bypass HTMLArea.htmlEncode, preventing the fix in ticket 127 from converting characters to HTML entities.

sleemanj commented 19 years ago

@sleemanj commented:

I'm going to bump this to version 2.0, I'm not keen on making such a large modification to core functionality just now.

@sleemanj changed milestone from Version 1.0 to 2.0

sleemanj commented 19 years ago

mharrisonline commented:

I did figure out how to make this work with ticket 127, afterall. This new code already takes care of the simple replacements for < and >, etc., so I altered the HTMLArea.htmlEncode in ticket 127 by removing all original regex expressions, and just leaving the the latin, greek, math, etc. I then used the HTMLArea.htmlEncode function on the final output, which normally would have turned the < and > symbols in the HTML into entities.

sleemanj commented 19 years ago

mharrisonline commented:

Wymsy's modification above to preserve Flash code works great for that purpose, and I had noticed that it also doesn't empty the Script node like the normal HTMLArea.getHTMLWrapper function does.

However, after testing this to see how it handles JavaScript in the code, I've found that because it isn't preserving formatting in script nodes, it causes unterminated string errors, etc. So, as it is right now, it isn't something that can be used with content that contains scripting. Also, in a case-sensitive LINUX environment it can be problematic when more than just tags are being converted to lowercase.

sleemanj commented 19 years ago

wymsy commented:

The formatting is done in the indent() function, separate from cleaning the tags. You might try commenting that line (html = XML_Utility.indent(html);) out to see if scripts work better that way. I'm looking at changing the indent function to not strip line breaks inside script and pre tags. I'll report back when I get that working.

sleemanj commented 19 years ago

wymsy commented:

I've done some more work on this, and I now have a version that I think takes care of all the issues noted above. To make it easy for others to try, I have packaged it as a plugin and attached it to this ticket.

Plugin features:

Much faster than HTMLArea.getHTML

Eliminates the hacks to accomodate browser quirks

Returns correct code for Flash objects and scripts

Formats html in an indented, readable format in html mode

Preserves script and pre formatting

Removes contenteditable from body tag in full-page mode

does not require stripBaseURL()

includes the expanded htmlEncode() function from ticket 127

It works well in my application, which does not use full-page and does not require stripBaseURL(). However, the limited testing I have done in those areas leads me to believe that stripBaseURL() is not needed ever, and the special requirements of full-page are handled.

I encourage others to try the plugin. If no other unsurmountable issues are uncovered, eventually this could be integrated into the core htmlarea.js

wymsy changed component from Xinha Core to Plugin_Other

sleemanj commented 19 years ago

GetHtml plugin Attachment: GetHtml.zip

sleemanj commented 19 years ago

niko commented:

looks nice! It creates valid XHTML-code!! amazing :D and much simpler than the old getHTML!

And it is really nice as a plugin - so we can include it into xinha and people can test it - without dumping the tested and working old getHTML-functions.

a few things i noticed: (using Firefox)

the expand singlet attributes is buggy, try the following html-code:
<select>
  <option value="1">asdf"</option>
  <option value="1" selected="selected">
</select> 
cleanHTML will make that out of it:
<select><option value="1">asdf"</option><option value="1"="selected="selected"=" selected="selected"="selected="selected"">asdf</option></select> 
the stripBaseURL-function is missing (as you pointed out allready) - the function IS necessary! (at least for me :D) probably using such reg-exprs you could call the stripBaseURL-functions:
<a[^>]*href="([^"]+)"[^>]*>
<img[^>]*src="([^"]+)"[^>]*>
(there are probably better ones, i'm not that good in regexpr-writing :D)

the expand query ampersands is buggy:
<a href="blah?param&otherparam">
gets converted into
<a href="blah?param&amp;otherparam">
this html-code
<a href="asdf" onclick="window.open('asdfadf')"">asdf</a>
gets converted into
<a href="asdf" onclick="try{if(document.designMode" && document.designmode="='on') return false;}catch(e){} window.open('asdfadf')"">asdf</a>
                                                 ^^^                           ^^^
and the last thing: imho the htmlEncode-function isn't necessary- with the right encoding all these characters should be saved correclty.

sleemanj commented 19 years ago

wymsy commented:

Niko, thanks for the feedback. Here is another version to try. Changes made:

• Fixed the regexp for expand singlet attributes.

• Added the stripBaseURL function. Now behaves the same as unmodified Xinha.

• Removed expand query ampersands. Now behaves the same as unmodified Xinha - the & appears in html view, but reverts to & on output.

• Fixed the problem with onclick. This was coming from the inwardHTML and outwardHTML functions. The regexps were modifying the string and preventing outwardHTML from matching it. Fixed with a patch to outwardHTML.

• Took out the htmlEncode function. For those who feel they need it, probably best to implement it as a separate plugin.

sleemanj commented 19 years ago

GetHtml plugin (updated) Attachment: GetHtml.2.zip

sleemanj commented 19 years ago

GetHtml plugin (v3) Attachment: .2

sleemanj commented 19 years ago

GetHtml plugin (v3) Attachment: GetHtml.3.zip

sleemanj commented 19 years ago

niko commented:

thanks! almost everything is fixed :D

the onclick="window.open isn't working perfect yet, give the Linker-Plugin a try, it will insert html-code like this:
<a onclick="window.open(this.href, 'popupwindow',  'toolbar=yes,scrollbars=yes,resizeable=yes');return false;" title="" target="popup" href="http://www.example.com/">consectetuer</a>
it gets a bit messed up into this:
<a onclick="window.open(this.href, 'popupwindow'," 'toolbar="yes,scrollbars=yes,resizeable=yes);return false;"" title="" target="popup" href="http://www.example.com/">consectetuer</a>
you use stripBaseURL only in ie (as in the original getHTML) - as Mozilla doesn't make absolute URLs out of relative. BUT, try this:
<a href="/test.html">foo</a>
it will be converted into
<a href="http://thedomain/test.html">foo</a>
when baseURL = "http://thedomain"; it will be stripped again (just like IE)

so imho you should use stripBaseURL for Mozilla too, what do you think?

sleemanj commented 19 years ago

derekcopelin~~@hotmail.com~~ commented:

Hi

Is there a way of altering this plug in so that it ignores particular tags? I previously used php to insert a style sheet external link to replicate the format used on the site in the editor and then stripped it with php on save. At the moment it is being stripped out by the plug in and I can't see exactly where to change it.

Thanks

Derek

sleemanj commented 19 years ago

wymsy commented:

Hmmm, this onclick thing could get messy. It's really more general than that, what we really need to do is isolate all onxxxx="(javascript)" event handlers and pass them through unmodified.

As for the semi-absolute links, that's getting complicated, too. I'm seeing some of the same behavior in standard xinha, but I haven't quite figured out what's going on yet.

Comments and suggestions welcome....

sleemanj commented 19 years ago

wymsy commented:

Derek, if you use the xinha_config.pageStyle or xinha_config.pageStyleSheets configuration option, the style sheet won't be visible to the plugin and won't be in the saved content, so stripping isn't a problem. (This assumes you are not using full-page mode.)

sleemanj commented 19 years ago

niko commented:

the stripBaseURL stuff is perfect now! It even works with Semi-Absolute-Links and Relative Links in both IE and FF (at least what my limited testing showed)

Mozilla fixes the links in fixRelativeLinks - so it is not needed in getHTML again.

....and it would be a killer-feature to leave php-tags alone! i hope it is possible :D

sleemanj commented 19 years ago

niko commented:

wow, these regexp's are difficult! i didn't know that it is possible to use \1 within an expression! (phps preg_match doesn't support that :( )

my suggestion for the onxxx-problem is to divide the add attribute quotes into two reg-exprs. one that fixes and (which doesn't effect the onxxx-properties)

and one reg-expr that looks for - where you don't have to use the space in [^>" ] to get the end value (which is currently the problem i think)

....and for the php-code-problem: is it enough to check for ^<\?.*\?>$ in cleanHTML? if it matches just return the tag as it is! Or does the browser then mess up the code?

...is the same possible for javascript-code within the html-code?

sleemanj commented 19 years ago

wymsy commented:

Niko, did you change something to get stripBaseUrl working? I didn't...?

sleemanj commented 19 years ago

mharrisonline commented:

I noticed that when you use this in IE with the Full Page plugin, body attributes are removed. Also, this JavaScript error popped up when the editor initialized:

Line: 155 Error: Could not set the innerHTML property. Invalid target element for this operation.

this._iframe.contentWindow.document.documentElement.innerHTML = this.inwardHtml(this._textArea.value);

sleemanj commented 19 years ago

niko commented:

to get the semi-absolute-links working i added a config.baseHref = 'http://domain.com'; (note: no slash at the end! if you add a slash at the end, you will get relative links)

thats all i have changed (and this was needed for the original xinha without getHTML-plguin too)

sleemanj commented 19 years ago

wymsy commented:

I have uploaded another revision, v4 above. Improvements since v3:

• fixed the onxxxx= problem. (modified regexps c0 and c2, and added c11)

• added code to ignore php tags.... but they still get stripped out somewhere else :(

• eliminated the javascript error in full-page mode.

• cleaned up some code for readability.

I don't know why body attributes are being stripped - it's not in this code. The other problem in full-page mode is that list items have no closing tags () in IE except for the last item in the list. The code that generated the error message was intended to fix that, but it doesn't work (obviously). It's not fatal, because browsers generally render the lists ok, but it's not right either.

Other than these issues, can anyone find any other problems?

sleemanj commented 19 years ago

GetHtml plugin (v4) Attachment: GetHtml.4.zip

sleemanj commented 19 years ago

wymsy commented:

Well, in my previous post I was half right on a couple of things. The attributes in body tags are being stripped out in FireFox before getting to this function. But in IE it was due to a subtle bug in the contenteditable regular expression, now fixed (v5 above).

Likewise, php tags are removed in FF, but not in IE. However, even in IE line breaks are lost.

sleemanj commented 19 years ago

GetHtml plugin (v5) Attachment: GetHtml.5.zip

sleemanj commented 19 years ago

wymsy commented:

Hmmm, did a little more testing and now body attributes are not getting stripped in FireFox either! I must have set something up wrong....

mharrisonline, can you confirm that it now works in full-page mode for you (in both browsers)? If so, then the only outstanding issue I know of is the problem in IE. (I don't use php, so I won't be putting a lot of effort into making that work. And it goes in the "enhancement" category anyway.)

(btw, I may have left a couple of debugging alerts in the latest version, in lines 58 and 77.)

sleemanj commented 19 years ago

wymsy commented:

I fixed the problem in IE, the right way this time, so it works in full-page mode also and doesn't rely on a mysterious hack to initIframe(). Attachement v5 above....

sleemanj commented 19 years ago

GetHtml plugin (v5) Attachment: GetHtml.6.zip

sleemanj commented 19 years ago

wymsy commented:

Oops, it's actually GetHtml.6.zip.

sleemanj commented 19 years ago

mharrisonline commented:

With tonight's XINHA download and GetHtml.6.zip used with the fullpage plugin, there is a closing li tag appearing in the head, right after the opening head tag. This only happen s in IE. If there is a title, the appears after the title.

Verbose script tags work fine, attributes are kept as well as formatting.

Noscript works fine.

Flash works fine as before, although the object parameters
 <param name="_cx" value="12250" />
            <param name="_cy" value="6641" />
are kind of odd. They don't seem to hurt anything though.

I didn't implement the full page plugin because I've seen too much background color abuse in online courses where the system provided easy access to changing the color. Absolutely unreadable pages with hideous dark dark brown backgrounds with black text seemed to be the most popular. So I only use the this.fullPage parameter set to true. I'm happy to report that everything I tried with the full page plugin worked the same without it and full page set to true.

This is a huge improvement over the current Xinha getHTML.

sleemanj commented 19 years ago

mharrisonline commented:

I think that a challenge for the future is to make body tags and body events not be corrupted in either browser.

Also there is the issue of allowing javascript to exist in the code. If a document.write statement exists, it will immediately write into the source code.

I handle that in my implementation by changing the word javascript to freezescript, onload to onplaceholder, and convert Githubissues.
Githubissues is a development platform for aggregating issues.

sleemanj / xinha

improved getHTML (Trac #253) #253