hallowelt / migrate-confluence

Tool to migrate content from Confluence export files into a MediaWiki compatible import source
GNU General Public License v3.0
35 stars 8 forks source link

mraw file has user content but converted .wiki file says "No context page id found" #113

Closed revansx closed 4 months ago

revansx commented 5 months ago

Hello,

I'm converting various Confluence XML files to Mediawiki sites using this tool. Everything runs really well from a command line procedure point of view including the MW import scripts, however, the clients are telling me that none of the text that they wrote in the confluence pages exists in the migrated mediawiki sites.

Admittedly, I have not had any direct insight into the Confluence sites. I'm just working with the XML.ZIP files they are giving me from the export process.

From the converted Mediawiki page I can obtain the mrawfile reference (i.e. 123456789, very helpful, thank you!) and from that I can find the related files generated by the migration tool as:

find . -name 123456789*
./workspace/content/raw/123456789.mraw
./workspace/content/raw/123456789.mprep
./workspace/content/wikitext/123456789.wiki

The contents of , say, 123456789.wiki is a perfect match with the wiki edit source of the corresponding MW page, so I'm confident that the importDump.php script is running without issue, however, looking at the corresponding 123456789.mraw file I see lots of user written content with lots of H1s and various paragraphs of user written text.. however when I look at the corresponding 123456789.wiki file it simply says, <-- No context page id found -->

The .mraw files seem to be all correct so the problem seems to be in the "Convert" step where it generates the .wiki files.

Please help!

My site is:

revansx commented 5 months ago

Here's an example of what I see in the MRAW file

<html><body>h1. XYZ Systems Division - Code 123 !123 Logo.jpg|thumbnail,height=200!

h1.

!MSD 123 - 2012.jpg|border=1!

h1.

h3. *Mission Statement:*

h3. Our Division is a multi-discipline organization that collaborates on all XYZ missions, from concept through development, test and flight.&nbsp; We provide xyz systems-centric hardware, services and focused technologies. We will partner with you to deliver an exceptional product when you need it at a competitive cost.

h3. *Vision Statement:*

h3. Our teamwork, talent and technology transform visionary ideas into awe-inspiring discoveries.

\\  !Slide1.jpg|border=1!

This is the home of the Foo Division - Code 123 space.

To help you on your way, we've inserted some of our favorite macros on this home page. As you start creating pages, blogging and commenting you'll see the macros below fill up with all the activity in your space.
{section}
{column:width=60%}
{recently-updated}
{column}
{column:width=5%}
{column}
{column:width=35%}

h6. Navigate space
{pagetreesearch}
{pagetree}
{column}
{section} &nbsp; </body></html>

and here is the entire contents of the converted .wiki file:

<-- No context page id found -->

I would totally understand if the macros didn't convert, but I'm surprised that the h1, h3 and vision and mission text do not. This makes me thing that something is not working right on my migration set-up.

osnard commented 5 months ago

Thanks for reporting. This is indeed very odd.

<-- No context page id found -->

comes from https://github.com/hallowelt/migrate-confluence/blob/a8f07a30b746d3944fee3ba1fef92c4bc70377ef/src/Converter/ConfluenceConverter.php#L157-L160

which indicates that there is either no entry in body-contents-to-pages-map.php or for some reason the $bodyContentId = $this->getBodyContentIdFromFilename(); already returned something odd.

osnard commented 5 months ago

Can you maybe log the value of $bodyContentId during the "convert" step and look into body-contents-to-pages-map.php?

revansx commented 5 months ago

Hi Robert. Thanks for responding.

I'm still a PHP noob. Is there an easy way to do the logging that is implied or should I attempt to print the value of $bodyContentId to the screen with some print statement between lines 157 and 158?

revansx commented 5 months ago

The body-contents-to-pages-map.php contents looks right (to me):

<?php

return array (
  235176309 => 234947396,
  235176311 => 234947398,
  235176314 => 234947405,
  235176316 => 234947408,
  235176318 => 234947410,
  235176320 => 234947412,
  235176322 => 234947414,
  235176324 => 234947416,
  302351140 => 302319443,
  330630030 => 330565856,
  330630032 => 330565858,
  330630031 => 330565857,
  330630016 => 330565834,
  330630015 => 330565833,
  330630003 => 330565816,
  330630002 => 330565815,
  330630005 => 330565818,
  330630004 => 330565817,
  330629999 => 330565812,
  330629998 => 330565811,
  330630001 => 330565814,
  330630000 => 330565813,
  330630010 => 330565824,
  etc...
osnard commented 5 months ago

[...] should I attempt to print the value of $bodyContentId to the screen with some print statement between lines 157 and 158?

Yes, if possible, you could modify the code like this:

$pageId = $this->getPageIdFromBodyContentId( $bodyContentId ); 
 if ( $pageId === -1 ) { 
    return '<-- No context page id found -->' . var_export( $bodyContentId, true ); 
 } 

It will output the value of $bodyContentId directly into the file.

revansx commented 5 months ago

Do I need to delete anything in the workspace folder before re-running the Converter script with that mod?

revansx commented 4 months ago

Whelp.. with 4 hours of help from Hallowelt (Thank you, Hallowelt) .. it seems clear that the problem was that the pandoc install path was not in my current session path and when the tool called pandoc, it failed silently.. simply not returning any converted content.. the solution was to add pandoc to my local path using:

export PATH="/:$PATH"

And a quick verification step of:

pandoc --verion

To prove that it works in the current session.

If Hallowelt were interested in making the tool more "fool" proof, I might risk suggesting that the tool complain a little if pandoc doesn't respond .. but as for me, I'll never make this mistake again, ha ha.

Oh, well. All's well that ends well. Cheers!