umputun / remark42

comment engine
https://remark42.com
MIT License
4.9k stars 381 forks source link

Add wordpress comment import. #156

Closed cooperaj closed 6 years ago

cooperaj commented 6 years ago

I'm looking to move my existing comment from wordpress to discuss. I could export from wordpress to discuss and then from discuss to remark but it'd be nice to cut out the middle man :)

umputun commented 6 years ago

Do you have WP spec for their export format? Also, some real example of exported comments will help as well

akosourov commented 6 years ago

It looks like Wordpress can export comments in different formats as XML or CSV. Suppose the db structure is here. If we find real examples of exported comments I can try to take this one, looks like similar to Disqus.Import

umputun commented 6 years ago

yeah, should be doable and not too hard. In addition to the new Importer implementation, you will need some minor change in rest.Migrator.importCtrl

akosourov commented 6 years ago

@cooperaj Could you give an example of WP exported comments?

cooperaj commented 6 years ago

Oh sure. Do you have somewhere I can put them? Or, at least somewhere I can send a link to them? A comment export is essentially a full export of your wordpress instance and I don't really want that up on github :)

On Fri, 20 Jul 2018 at 01:39 Anton Kosourov notifications@github.com wrote:

@cooperaj https://github.com/cooperaj Could you give an example of WP exported comments?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/umputun/remark/issues/156#issuecomment-406454906, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYbUu5dAhw9NhAh_lsmXSu8MVPsYQLdks5uISa5gaJpZM4VLBFX .

akosourov commented 6 years ago

@cooperaj I think you can send email with comments to @umputun and to me (because I'm going to solve it). You can find email addresses on github user's page

cooperaj commented 6 years ago

@akosourov I've emailed an export of my blog to both your addresses :)

akosourov commented 6 years ago

@cooperaj yeah, see thanks @umputun May I implement this ticket?

umputun commented 6 years ago

@akosourov sure

akosourov commented 6 years ago

@cooperaj due to #182 merged remark has WP importer. Try to import all comments to remark and leave feed back, please. Also check the instruction about importing from WP, and if it is not so obvious please open PR to update it or simple say what is wrong

cooperaj commented 6 years ago

👍 LGTM

With the caveat that I used the same export that you developed against so may not get to see any hidden issues.

Some imported comments visible here

cooperaj commented 6 years ago

Though as you can see on the last comment on that page the markdown processing hasn't quite worked.

Oh, and having scrolled through a lot more I can see that it's also stripped paragraphs, though I don't know if thats down to the contents of the export or the import process.

EDIT The export file contains newlines just fine so it looks like the import is what removed them.

umputun commented 6 years ago

The export file contains newlines just fine so it looks like the import is what removed them.

@akosourov you do it here https://github.com/umputun/remark/blob/master/backend/app/migrator/wordpress.go#L151

I think you just duplicated logic I have for Disqus but it can be not applicable for WP files. The reason why I had this cleanText as a part of Disqus (you promoted it to top-level func in order to hit it from WP importer) is crazy output Disqus generating. Sometimes it has a bunch of \n and bunch of \t with no recognizable pattern.

If WP doesn't suffer from the same issue such cleanup not needed.

akosourov commented 6 years ago

@umputun My goal was to make common logic for converting WP and Disqus so cleanText were shared between them. Yeah, I see for WP it is not suitable (I suppose) and I can return it back. Or clean only multiple \n like \n\n\n

What about markdown converting? Maybe make more common logic for "text to html" converting and put it in DataService.Create I see preview and create use the same code. And Importer could use it

umputun commented 6 years ago

The part about markdown converting is not clear to me. Is WP even has comments in markdown? If all we need is dealing with \n properly the cleanText removal (or replacing multiple \n\n\n... to a single \n) should do it.

Regarding changing DataService.Create to apply md parsing - I don't think it will be the right place. DataService.Create wraps engine.Create and deals with additional validation and sanitizing, but generally doesn't alter the content. If we change text by running md convertor it may have some unexpected side-effects, like each backup/restore producing changed text. In addition, I consider md parsing as something above DataService level.

akosourov commented 6 years ago

@umputun Yeah, I've found out WP supports markdown in comments if it is enabled.

I consider md parsing as something above DataService level

Maybe add DataService.AlterText function (method) and this procedure could alter content like:

  1. Convert markdown to html
  2. Convert img links to proxy
  3. Short links
  4. Clean multiple \n\n\n to \n\n (or one only see below)

So WP importer could use same logic as remark use when saves comments.

I've sent original comment from WP exported file on remark demo page to see how remark renders same input. This is here. It looks like remark also trims empty rows (I mean \n\n to \n). In my opinion it could be better to have an opportunity to put at least one empty row. It may improve user experience to have such indent

umputun commented 6 years ago

@akosourov the problem I see with this approach is way too many unusual responsibilities added to DataService. Currently, markdown to html is a part of rest create & preview controllers, img links to proxy lives in proxy and so on. Moving all of this to DataService doesn't feel right. Yeah, technically all of this some kind of "text formatting" but I'd like to keep such formatting as close to consumer (caller) as possible.

Theoretically, a nicer approach may be to import WP indirectly, i.e. pass to WP constructor enough info to allow it submitting comments in the usual way, via rest calls. This won't require much work and refactoring, however, it won't work as of now, because rest calls authentication middleware extracting user info from the token. Even if we make token per user we will run into other issues, like create controller assigning timestamps to comments and so on.

I don't have a good idea how to deal with this yet and not even sure if we really want/need a generic solution here. If we really have to have the same processing path for any text input (rest or import), we may need a new abstraction, something like CommentFormatter and pass it to rest, wp imported and any other place we like.

Anyway, let me think about it and maybe I'll play around this idea on code level to see how it goes.

umputun commented 6 years ago

@akosourov I've added CommentFormater (see PR #186). It does md processing, link shortening and accepts user-defined CommentConverter. This way some external functionality can be passed as needed. See in main.go how it used for image proxy.

I think all you need is to add this thing to your wordpress importer and it should do. Let me know if any confusions.

akosourov commented 6 years ago

@umputun I've changed wp importer, so it uses CommentFormater. Also, when debugging, I've noticed that CommentFormater or maybe Sanitize adds new line character to the end of content. I mean content like text converts to <p>text</p>\n. Don't see that it is a problem. And also logic like multiple \n\n\n to single \n could be placed in predefined CommentFormatter.

Also I want ask should WP importer (or any like Disqus importer) give an opportunity to update comments if they are already exists. Because now importers removes all comments for site and starts to add new. So if we add update functionality users can run importers multiple time and don't worry about removing possible new comments that were added after

umputun commented 6 years ago

content like text converts to

text

\n

This \n added by md (blackfriday). Not even sure why it does this, maybe a side effect of some of the flags we pass. Never bothered enough by this to investigate ;)

logic like multiple \n\n\n to single \n

Confused, where do you see this logic now? If you pass \n\n to create (rest controller) it won't replace those \n\n to \n but will add and extra <p>

update comments if they are already exists

I don't think this is needed. Import is an initial step anyway, like restore from the backup. Adding such upsert support will complicate the procedure internally (matching can be tricky for auto-gen IDs) and will make importing significantly slower for some engines.

akosourov commented 6 years ago

@cooperaj You can retry to export comments from wp to remark. Due to #192 merged we have MD support for import procedure. Also any content from import procedure should be rendered in common way. Be aware that import procedure removes all your existing comments

umputun commented 6 years ago

@akosourov what is the reason for html.UnescapeString(text)? This one is a little bit confusing to me as it runs after MD formatting but before Sanitize. Did you have some use case/issue this unescape supposed to resolve?

akosourov commented 6 years ago

what is the reason for html.UnescapeString(text)?

I've added it because wp export file contains encoded charaters. It looks like wp encodes such characters like « ’ " – and doesn't unescape them in export file. Here list of encoded characters that I found in example export file:

&#039;
&quot;
&amp;
&laquo;
&#8217;
&#8211;
&raquo;
&nbsp;

Real example exitsts in test here

umputun commented 6 years ago

ok, I guess it won't hurt because Sanitize will escape all dangerous elements anyway.

cooperaj commented 6 years ago

Mmm, the markdown appears to have worked but I'm still seeing the truncation of newlines. My export file might contain

<wp:comment_content><![CDATA[Looks like http://releases.rancher.com/os/latest is no longer hosted - installs using this `base-url` are failing.

I switched to Github with success:

\```
set base-url https://github.com/rancher/os/releases/download/v1.1.1-rc1
\```

Thanks for the article!]]></wp:comment_content>

But the imported comment looks like

Looks like http://releases.rancher.com/os/latest is no longer hosted - installs using this base-url are failing.I switched to Github with success:set base-url https://github.com/rancher/os/releases/download/v1.1.1-rc1Thanks for the article!

See how the new paragraphs are removed and there are now spaces at all in the sentences?

Note: I've escaped the backticks in the export snippet. Just for display purposes here.

umputun commented 6 years ago

@cooperaj this is odd. I see the test for very similar case https://github.com/umputun/remark/blob/master/backend/app/migrator/wordpress_test.go#L83 using this data

@akosourov any idea? You can try to add this particular comment, but I bet it will pass. If this is the case you need some changes inTestMigrator_ImportFromWP to test the content (with MD formatting), not just a count.

umputun commented 6 years ago

@akosourov - I have added the test

@cooperaj - can't reproduce the issue. Have a new end-to-end test importing from the comment you provided and got properly formatted record

 <p>Looks like <a href="http://releases.rancher.com/os/latest" rel="nofollow">http://releases.rancher.com/os/latest</a> is no longer hosted - installs using this <code>base-url</code> are failing.</p>\n\n<p>I switched to Github with success:</p>\n\n<pre><code>set base-url https://github.com/rancher/os/releases/download/v1.1.1-rc1\n</code></pre>\n\n<p>Thanks for the article!</p>\n
akosourov commented 6 years ago

@umputun Yes, this is odd for me too. Perhaps my test passed because md contains backticks that conflicts with go string literals " ` " so I used to combine string using +. I mean this https://github.com/umputun/remark/blob/master/backend/app/migrator/wordpress_test.go#L306 Maybe it adds additional \n

I will try to run all and see from ui (a little bit later)

umputun commented 6 years ago

@akosourov In my test, I have enforced backticks by strings.Replace and this is the comment

cooperaj commented 6 years ago

I'm not sure whats going on then. I just removed my DB file and ran the import on a fresh database and I'm still seeing the paragraphless version. But I am seeing the markdown.

I'll rebuild my docker container and see if theres anything I missed but it was built yesterday from a fresh pull to my knowledge.

Update

Things I did:

The import brought the comments across with markdown but without paragraphs as can be seen here

umputun commented 6 years ago

@cooperaj - thx for the detailed report. The good news - I was able to reproduce the issue by doing a complete import. Still have no idea what the root cause, but this is an easy part :) hopefully, fill figure it soon.

umputun commented 6 years ago

yay!

mkp0j-201808-17160927-ouhsb

This was a tricky one. The reason why none of our tests were able to catch it is very ... refreshing :) There was no error in the code, and it worked just fine. The issue is the way import-wordpress.sh script submitted the data - instead of -d it just needs --data-binary.

@akosourov - I know you copied this script from import-disqus.sh and modified a little bit. However, for Disqus import EOL (or lack of them) inside of a comment changes nothing as they export rendered html snippet.

akosourov commented 6 years ago

@umputun oh, I didn't know the difference between curl --data ... and curl --data-binary ... Just thought that -d puts data in body and doesn't do any processing with input data. Thanks a lot, it's really helpful experience for me

umputun commented 6 years ago

@cooperaj - if you fine with the latest fixes feel free to close this ticket.

cooperaj commented 6 years ago

All looks great to me. Sorry for taking so long to investigate the fix, been super busy :)