Closed cooperaj closed 6 years ago
Do you have WP spec for their export format? Also, some real example of exported comments will help as well
It looks like Wordpress can export comments in different formats as XML or CSV. Suppose the db structure is here.
If we find real examples of exported comments I can try to take this one, looks like similar to Disqus.Import
yeah, should be doable and not too hard. In addition to the new Importer implementation, you will need some minor change in rest.Migrator.importCtrl
@cooperaj Could you give an example of WP exported comments?
Oh sure. Do you have somewhere I can put them? Or, at least somewhere I can send a link to them? A comment export is essentially a full export of your wordpress instance and I don't really want that up on github :)
On Fri, 20 Jul 2018 at 01:39 Anton Kosourov notifications@github.com wrote:
@cooperaj https://github.com/cooperaj Could you give an example of WP exported comments?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/umputun/remark/issues/156#issuecomment-406454906, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYbUu5dAhw9NhAh_lsmXSu8MVPsYQLdks5uISa5gaJpZM4VLBFX .
@cooperaj I think you can send email with comments to @umputun and to me (because I'm going to solve it). You can find email addresses on github user's page
@akosourov I've emailed an export of my blog to both your addresses :)
@cooperaj yeah, see thanks @umputun May I implement this ticket?
@akosourov sure
@cooperaj due to #182 merged remark has WP importer. Try to import all comments to remark and leave feed back, please. Also check the instruction about importing from WP, and if it is not so obvious please open PR to update it or simple say what is wrong
👍 LGTM
With the caveat that I used the same export that you developed against so may not get to see any hidden issues.
Though as you can see on the last comment on that page the markdown processing hasn't quite worked.
Oh, and having scrolled through a lot more I can see that it's also stripped paragraphs, though I don't know if thats down to the contents of the export or the import process.
EDIT The export file contains newlines just fine so it looks like the import is what removed them.
The export file contains newlines just fine so it looks like the import is what removed them.
@akosourov you do it here https://github.com/umputun/remark/blob/master/backend/app/migrator/wordpress.go#L151
I think you just duplicated logic I have for Disqus but it can be not applicable for WP files. The reason why I had this cleanText
as a part of Disqus (you promoted it to top-level func in order to hit it from WP importer) is crazy output Disqus generating. Sometimes it has a bunch of \n and bunch of \t with no recognizable pattern.
If WP doesn't suffer from the same issue such cleanup not needed.
@umputun
My goal was to make common logic for converting WP and Disqus so cleanText
were shared between them. Yeah, I see for WP it is not suitable (I suppose) and I can return it back. Or clean only multiple \n
like \n\n\n
What about markdown converting? Maybe make more common logic for "text to html" converting and put it in DataService.Create I see preview and create use the same code. And Importer could use it
The part about markdown converting is not clear to me. Is WP even has comments in markdown? If all we need is dealing with \n properly the cleanText removal (or replacing multiple \n\n\n... to a single \n) should do it.
Regarding changing DataService.Create
to apply md parsing - I don't think it will be the right place. DataService.Create wraps engine.Create and deals with additional validation and sanitizing, but generally doesn't alter the content. If we change text by running md convertor it may have some unexpected side-effects, like each backup/restore producing changed text. In addition, I consider md parsing as something above DataService level.
@umputun Yeah, I've found out WP supports markdown in comments if it is enabled.
I consider md parsing as something above DataService level
Maybe add DataService.AlterText
function (method) and this procedure could alter content like:
So WP importer could use same logic as remark use when saves comments.
I've sent original comment from WP exported file on remark demo page to see how remark renders same input. This is here. It looks like remark also trims empty rows (I mean \n\n
to \n
). In my opinion it could be better to have an opportunity to put at least one empty row. It may improve user experience to have such indent
@akosourov the problem I see with this approach is way too many unusual responsibilities added to DataService
. Currently, markdown to html is a part of rest create & preview controllers, img links to proxy lives in proxy and so on. Moving all of this to DataService
doesn't feel right. Yeah, technically all of this some kind of "text formatting" but I'd like to keep such formatting as close to consumer (caller) as possible.
Theoretically, a nicer approach may be to import WP indirectly, i.e. pass to WP constructor enough info to allow it submitting comments in the usual way, via rest calls. This won't require much work and refactoring, however, it won't work as of now, because rest calls authentication middleware extracting user info from the token. Even if we make token per user we will run into other issues, like create controller assigning timestamps to comments and so on.
I don't have a good idea how to deal with this yet and not even sure if we really want/need a generic solution here. If we really have to have the same processing path for any text input (rest or import), we may need a new abstraction, something like CommentFormatter
and pass it to rest, wp imported and any other place we like.
Anyway, let me think about it and maybe I'll play around this idea on code level to see how it goes.
@akosourov I've added CommentFormater
(see PR #186). It does md processing, link shortening and accepts user-defined CommentConverter
. This way some external functionality can be passed as needed. See in main.go how it used for image proxy.
I think all you need is to add this thing to your wordpress importer and it should do. Let me know if any confusions.
@umputun I've changed wp importer, so it uses CommentFormater
.
Also, when debugging, I've noticed that CommentFormater
or maybe Sanitize
adds new line character to the end of content. I mean content like text
converts to <p>text</p>\n
. Don't see that it is a problem. And also logic like multiple \n\n\n
to single \n
could be placed in predefined CommentFormatter
.
Also I want ask should WP importer (or any like Disqus importer) give an opportunity to update comments if they are already exists. Because now importers removes all comments for site and starts to add new. So if we add update functionality users can run importers multiple time and don't worry about removing possible new comments that were added after
content like text converts to
text
\n
This \n
added by md (blackfriday). Not even sure why it does this, maybe a side effect of some of the flags we pass. Never bothered enough by this to investigate ;)
logic like multiple \n\n\n to single \n
Confused, where do you see this logic now? If you pass \n\n
to create (rest controller) it won't replace those \n\n
to \n
but will add and extra <p>
update comments if they are already exists
I don't think this is needed. Import is an initial step anyway, like restore from the backup. Adding such upsert support will complicate the procedure internally (matching can be tricky for auto-gen IDs) and will make importing significantly slower for some engines.
@cooperaj You can retry to export comments from wp to remark. Due to #192 merged we have MD support for import procedure. Also any content from import procedure should be rendered in common way. Be aware that import procedure removes all your existing comments
@akosourov what is the reason for html.UnescapeString(text)
? This one is a little bit confusing to me as it runs after MD formatting but before Sanitize
. Did you have some use case/issue this unescape supposed to resolve?
what is the reason for html.UnescapeString(text)?
I've added it because wp export file contains encoded charaters. It looks like wp encodes such characters like « ’ " –
and doesn't unescape them in export file.
Here list of encoded characters that I found in example export file:
'
"
&
«
’
–
»
Real example exitsts in test here
ok, I guess it won't hurt because Sanitize
will escape all dangerous elements anyway.
Mmm, the markdown appears to have worked but I'm still seeing the truncation of newlines. My export file might contain
<wp:comment_content><![CDATA[Looks like http://releases.rancher.com/os/latest is no longer hosted - installs using this `base-url` are failing.
I switched to Github with success:
\```
set base-url https://github.com/rancher/os/releases/download/v1.1.1-rc1
\```
Thanks for the article!]]></wp:comment_content>
But the imported comment looks like
Looks like http://releases.rancher.com/os/latest is no longer hosted - installs using this base-url are failing.I switched to Github with success:set base-url https://github.com/rancher/os/releases/download/v1.1.1-rc1Thanks for the article!
See how the new paragraphs are removed and there are now spaces at all in the sentences?
Note: I've escaped the backticks in the export snippet. Just for display purposes here.
@cooperaj this is odd. I see the test for very similar case https://github.com/umputun/remark/blob/master/backend/app/migrator/wordpress_test.go#L83 using this data
@akosourov any idea? You can try to add this particular comment, but I bet it will pass. If this is the case you need some changes inTestMigrator_ImportFromWP
to test the content (with MD formatting), not just a count.
@akosourov - I have added the test
@cooperaj - can't reproduce the issue. Have a new end-to-end test importing from the comment you provided and got properly formatted record
<p>Looks like <a href="http://releases.rancher.com/os/latest" rel="nofollow">http://releases.rancher.com/os/latest</a> is no longer hosted - installs using this <code>base-url</code> are failing.</p>\n\n<p>I switched to Github with success:</p>\n\n<pre><code>set base-url https://github.com/rancher/os/releases/download/v1.1.1-rc1\n</code></pre>\n\n<p>Thanks for the article!</p>\n
@umputun Yes, this is odd for me too. Perhaps my test passed because md contains backticks that conflicts with go string literals " ` " so I used to combine string using +. I mean this https://github.com/umputun/remark/blob/master/backend/app/migrator/wordpress_test.go#L306 Maybe it adds additional \n
I will try to run all and see from ui (a little bit later)
@akosourov In my test, I have enforced backticks by strings.Replace and this is the comment
I'm not sure whats going on then. I just removed my DB file and ran the import on a fresh database and I'm still seeing the paragraphless version. But I am seeing the markdown.
I'll rebuild my docker container and see if theres anything I missed but it was built yesterday from a fresh pull to my knowledge.
Things I did:
The import brought the comments across with markdown but without paragraphs as can be seen here
@cooperaj - thx for the detailed report. The good news - I was able to reproduce the issue by doing a complete import. Still have no idea what the root cause, but this is an easy part :) hopefully, fill figure it soon.
yay!
This was a tricky one. The reason why none of our tests were able to catch it is very ... refreshing :)
There was no error in the code, and it worked just fine. The issue is the way import-wordpress.sh
script submitted the data - instead of -d
it just needs --data-binary
.
@akosourov - I know you copied this script from import-disqus.sh
and modified a little bit. However, for Disqus import EOL (or lack of them) inside of a comment changes nothing as they export rendered html snippet.
@umputun oh, I didn't know the difference between curl --data ...
and curl --data-binary ...
Just thought that -d
puts data in body and doesn't do any processing with input data.
Thanks a lot, it's really helpful experience for me
@cooperaj - if you fine with the latest fixes feel free to close this ticket.
All looks great to me. Sorry for taking so long to investigate the fix, been super busy :)
I'm looking to move my existing comment from wordpress to discuss. I could export from wordpress to discuss and then from discuss to remark but it'd be nice to cut out the middle man :)