jackjamieson2 / yarns-microsub-server

Yarns is a Microsub server that runs as a plugin on your WordPress site.
GNU General Public License v2.0
20 stars 4 forks source link

Santize HTML content #74

Closed jackjamieson2 closed 5 years ago

jackjamieson2 commented 5 years ago

See https://github.com/aaronpk/Monocle/issues/39

Yarns should sanitize HTML content when saving it.

I will look into Aperture and copy its sanitization

dshanske commented 5 years ago

Is this better on the parsing side?

jackjamieson2 commented 5 years ago

@dshanske, I was literally in the middle of typing a comment to ask you that 😁

I think it would make sense on the parsing side assuming the sanitization needs of Post Kinds etc. don't conflict with Yarns

dshanske commented 5 years ago

I do limitedly sanitize

jackjamieson2 commented 5 years ago

In this case (aaronpk/Monocle#39), a style attribute got through that makes a div overflow Monocle's content container. Apparently, Aperture removes style attributes, which seems sensible to me.

dshanske commented 5 years ago

Aperture uses xray, I thought, which uses HTML purifier, which is probably better than kses in WordPress

dshanske commented 5 years ago

I looked into this last night. Option 1 is to use HTML purifier and aaronpk's filter. Second option is to use kses, which is built into wordpress. I just have to be very specific on the parameters as you can't by attribute value in the built in version.

dshanske commented 5 years ago

At the least, I will sanitize the individual pieces, as opposed to the entire input going forward

dshanske commented 5 years ago

Sanitizing inline styles would remove the icon work I did prior

dshanske commented 5 years ago

https://github.com/dshanske/parse-this/commit/0da51c1fe536bee85a602b28ac44d57292c7ef64#diff-F81F3DFF36BF04B7A070D642C858E883

dshanske commented 5 years ago

Starting on a new filter for content and tweaking it

jackjamieson2 commented 5 years ago

Looking good for the most part! A couple remaining issues:

Feeds from arxiv.org

e.g. http://arxiv.org/rss/cs.HC

The dc:creator field contains a list of authors, with links to each of their individual pages.

<dc:creator>
<a href="http://arxiv.org/find/cs/1/au:+Anton_S/0/1/0/all/0/1">Simon Duque Anton</a>, <a href="http://arxiv.org/find/cs/1/au:+Fraunholz_D/0/1/0/all/0/1">Daniel Fraunholz</a>, <a href="http://arxiv.org/find/cs/1/au:+Lipps_C/0/1/0/all/0/1">Christoph Lipps</a>, <a href="http://arxiv.org/find/cs/1/au:+Alam_K/0/1/0/all/0/1">Khurshid Alam</a>, <a href="http://arxiv.org/find/cs/1/au:+Schotten_H/0/1/0/all/0/1">Hans Dieter Schotten</a>
</dc:creator>

When translated to jf2, this leads to a single h-card, with a name property covering each author and their links>

"author": {
        "type": "card",
        "name": "<a href=\"http://arxiv.org/find/cs/1/au:+Kucherenko_T/0/1/0/all/0/1\">Taras Kucherenko</a>, <a href=\"http://arxiv.org/find/cs/1/au:+Hasegawa_D/0/1/0/all/0/1\">Dai Hasegawa</a>, <a href=\"http://arxiv.org/find/cs/1/au:+Henter_G/0/1/0/all/0/1\">Gustav Eje Henter</a>, <a href=\"http://arxiv.org/find/cs/1/au:+Kaneko_N/0/1/0/all/0/1\">Naoshi Kaneko</a>, <a href=\"http://arxiv.org/find/cs/1/au:+Kjellstrom_H/0/1/0/all/0/1\">Hedvig Kjellstr&#xf6;m</a>"
    },

As I understand, the name property should be plain text. Tested with Monocle and Indigenous iOS and they both display the HTML:

Screen Shot 2019-05-30 at 2 21 43 PM

The ideal solution would be to translate this author list into multiple h-cards. Failing that, the HTML tags should be stripped from author['name']

\" HTML entity in post name

Screen Shot 2019-05-30 at 2 45 59 PM
{
    "type": "entry",
    "name": "&quot;I met you at a house party in Montreal&quot;",
    "author": {
        "type": "card",
        "name": "Peter Rukavina",
        "email": "peter@rukavina.net (Peter Rukavina)"
    },
    "publication": "Peter Rukavina&#039;s Weblog",
    "summary": "I\u2019ve had Montreal, by Port Cities (recorded | live | remix) on rotation for weeks now.\n\nIt is a lovely song, with strong lyrics. But, more than that, I think it\u2019s a sort of universal song, inasmuch as the possibility of meeting someone at a house party in Montreal is within the atlas of possibilities for many Canadians: it just seems like the kind of thing we might all have in our past.\n\nI recall hearing a story, perhaps apocryphal, about the late Marc Gallant: he was living in an apartment in Montreal, and about to be evicted. So he invited everyone he knew to a house party and asked everyone to take something from the apartment home with them, to prevent the apartment\u2019s contents from being seized by the landlord.\n\nJust over a decade ago we saw Bruce Guthro in concert at Harmony House in Hunter River; Bruce brought his then-17-year-old son Dylan up on stage with him during that concert, and it was obvious that he\u2019d inherited the family musicality. Ten years on he\u2019s one of the trio that is Port Cities.",
    "published": "2019-05-29T13:27:06+00:00",
    "url": "https://ruk.ca/content/i-met-you-house-party-montreal",
    "uid": "https://ruk.ca/content/i-met-you-house-party-montreal",
    "post_type": "article",
    "date": "2019-05-29T13:27:06+00:00",
    "_id": 11686,
    "_is_read": true,
    "_yarns_microsub_post_read": ""
}

There are " entities in the 'name' field, which are rendered in Monocle and Indigenous iOS

jackjamieson2 commented 5 years ago

Upgraded to the newest Parse-This as of today: https://github.com/dshanske/parse-this/commit/5e7d6af2eaf07f10e494a64a9cfc95df754d1621

http://export.arxiv.org/rss/cs.HC

jackjamieson2 commented 5 years ago

In the 'name' property, quotation marks are parsed as &quot;

e.g.

{
    "type": "entry",
    "name": "&quot;Our weird puritanical crap about pain and pain medication&quot;",
    "author": {
        "type": "card",
        "name": "Peter Rukavina",
        "email": "peter@rukavina.net (Peter Rukavina)"
    },
    "publication": "Peter Rukavina&#039;s Weblog",
    "summary": "Tim Carmody, on Kottke.org, writes about pain medication; in part:\n\n\nThis is all to say: no, I\u2019m not on pain medication. Yes, I\u2019m terribly uncomfortable. No, I\u2019m not uncomfortable enough to jump through hoops and beg for more drugs. (Maybe if I were, things would be different.) And at the times I was most uncomfortable, those were the times when medicine was the least available to me, by design.\n\nWe\u2019ve got to get over our weird Puritanical crap about pain and pain medication, and accept the fact that in certain contexts, we need the drugs. And by \u201cwe,\u201d I mean myself, the medical system; everybody. We can\u2019t be responsible for the entire opioid epidemic every second of every day. Sometimes we just need to be able to go to sleep.\n\n\nI\u2019ve learned a lot about this domain in the last 5 years. The backlash against opioids, while understandable given the toll that they\u2019ve taken, has also served to cast an unhelpful shadow over pain meds of any type. For those in chronic pain, carrying the extra weight of that shadow is the last thing they need.",
    "_source": {
        "url": "https://ruk.ca/rss/feedburner.xml",
        "name": "Peter Rukavina's Weblog"
    },
    "published": "2019-06-10T17:18:06+00:00",
    "url": "https://ruk.ca/content/our-weird-puritanical-crap-about-pain-and-pain-medication",
    "uid": "https://ruk.ca/content/our-weird-puritanical-crap-about-pain-and-pain-medication",
    "post_type": "article",
    "date": "2019-06-10T17:18:06+00:00",
    "_id": 12515,
    "_is_read": false
}
dshanske commented 5 years ago

In my version of the archiv.org feed, there are multiple authors found, and therefore author is a multi-array. However, this is not currently covered by the Microsub spec. I added multi-author support in Parse This based on you frequently citing feeds that had multiple authors.

Looking at the feed you provide for Peter Rukavina, I probably should validate the email address actually being an email address. As for quot within the name, that is actually what the RSS feed is doing.

jackjamieson2 commented 5 years ago

Closing this, since the main validations issues are resolved. The question of handling multi-author posts remains, but I think it makes sense to pursue this in the microsub spec, with Parse-This's implementation as a good model