veliovgroup / jazeee-meteor-spiderable

Fork of Meteor Spiderable with longer timeout, caching, better server handling
https://atmospherejs.com/jazeee/spiderable-longer-timeout
33 stars 9 forks source link

Spiderable and json-ld #42

Closed MichalW closed 7 years ago

MichalW commented 7 years ago

My script ld+json does not appear in ?_escapedfragment\= mode

<script type="application/ld+json"></script> 

Is spiderable removed all script tags?

dr-dimitru commented 7 years ago

@MichalW right, all script tags is removed.

MichalW commented 7 years ago

I think is this line: https://github.com/jazeee/jazeee-meteor-spiderable/blob/master/lib/phantom_script.js#L49

Could you change: output = output.replace(/<script[^>]+>(.|\n|\r)*?<\/script\s*>/ig, ''); to eg. output = output.replace(/<script(?![^>]+ld\+json)[^>]+>(.|\n|\r)*?<\/script\s*>/ig , '');

MichalW commented 7 years ago

Maybe it will be helpful: https://github.com/meteor/meteor/pull/5015/commits/2314e90c51403e46a8dbf1405414d1a6cf99e5cf

dr-dimitru commented 7 years ago

My thoughts - regex should match any of those examples:

<script type="text/javascript">
  function () {
    return true;
  }
</script>
<script type='text/javascript'>
  function () {
    return true;
  }
</script>
<script type="application/javascript">
  function () {
    return true;
  }
</script>
<script type="application/javascript" async defer>
  function () {
    return true;
  }
</script>
<script async defer charset="utf8">
  function () {
    return true;
  }
</script>
<script async src="http://example.com/main.js" charset="UTF-8"></script>
<script async>
  function () {
    return true;
  }
</script>
<script>
  function () {
    return true;
  }
</script>

And do not match next:

<script type="application/ld+json">
  {bar: "foo"}
</script>
<script type="application/ld+json" async defer>
  {bar: "foo"}
</script>
<script async type="application/ld+json" defer>
  {bar: "foo"}
</script>
<script type="any-other/type">
  {bar: "foo"}
</script>

I'm stuck with next regex:

/<script[^>]*(?=type\=("|')(application|text)\/javascript("|'))[^>]*>(.|\n|\r)*?<\/script\s*>/gi

Works correctly unless type attribute exists. Hope someone can finish it. Playground: https://regex101.com/r/PtlDsd/1

MichalW commented 7 years ago

Please try this:

/<script([^>](?!type))*[^\s>]?(\stype\=("|')(application|text)\/javascript("|'))?([^>](?!type))*>(.|\n|\r)*?<\/script\s*>/gi
dr-dimitru commented 7 years ago

@MichalW great solution. @jazeee will you change it or would you have a PR?

dr-dimitru commented 7 years ago

Hi @MichalW ,

Thank you for contribution, merged and published as v1.3.0

Feel free to reopen it in case if issue still persists on your end.