Modifying website to prepare for webarchiving

ghost commented 10 years ago

from Jennie:

Problem: In 2001, a friend and I started a website called the Sugar Quill. It started off as basic HTML. Circa 2002, a friendly 15-year-old computer whiz offered to build us a content management system, based on MySQL and PHP. The website also included message boards that use the free InvisionBoard software. The site has not been "active" since 2008 but I am still paying for the server hosting. I would like the site archived at the Internet Archive, using the Wayback Machine, but the Internet Archive will not dig deep enough into the site, due to the way the links are generated and the site navigation works. See an example here: Sugar Quill dead link example I would like to figure out a programmatic way to a) extract the story archive data from the MySQL, b) convert it to straight HTML, and c) modify the links so that they can be crawled by the Internet Archive and I can STOP PAYING FOR THE WEBSITE!

ghost commented 10 years ago

Jennie, it looks the dead link example is not a page which was archived and contains bad links but a page that was not archived because the link from the referring page was bad, is that correct? What is the IA link to the referring page?

Some possibly easier ways to accomplish this, other than database extract and manual conversion to HTML:

harvest the site locally with a program like wget (which can overcome the link problem?) and then put the static html up on the site instead of the PHP
modify the PHP to generate different types of links in the first place

jennielevineknies commented 10 years ago

Ben. I think the option of harvesting the site locally is what I ultimately want to do. But then I also have to change how the linking happens. The URL structure on the stories is as follows:

http://www.sugarquill.net/read.php?storyid=661&chapno=1 http://www.sugarquill.net/read.php?storyid=661&chapno=2 etc.

So, if storyid 661 has a second chapter, the structure is pretty basic for the URLs.

However, the code driving that navigation is what gives the Internet Archive problems. The code when translated to HTML something like this...

Author: Arabella (Professors' Bookshelf) Story: Hermione, Queen of Witches, Book One Chapter: Chapter One

Internet Archive does not understand what read.php wants you to do, or that it wants you to connect to a database, etc. to get to Chapter Two. I am not sure, but I suspect that it is a combination of the php, but also the fact that we use a pull-down for navigation. I've noticed that the Wayback Machine has a lot of trouble with the pull-down navigation at ArchivesUM also.

I'll paste the php from read.php below. Obviously, I think if it were easier to modify the PHP, I would do that... but I'd like to understand more how it's working.

<?php $is_safe = false;

$query = "SELECT storyid, chapno FROM sq_stories";
$result = $DB->sq_query($query);

if (mysql_num_rows($DB->d_result) > 0) {
    while ($row = mysql_fetch_object($DB->d_result)) {
        if ($row->storyid == $storyid) {
                if ($row->chapno == $chapno) {
                    $is_safe = true;
                }
        }
    }
}

if (!$is_safe) {
    echo("<tr>\n");
    echo("<td class=\"info2_pane\">Invalid storyid or chapno</td>\n");
    echo("</tr>\n");
    die();
}

$query = "SELECT sq_stories.title, sq_stories.chaptitle, sq_stories.url, sq_stories.authorid, sq_authors.name, sq_authors.prof_stat FROM sq_stories, sq_authors WHERE sq_stories.storyid = $storyid AND sq_stories.chapno = $chapno AND sq_stories.authorid = sq_authors.authorid";
$result = $DB->sq_query($query);

$row = mysql_fetch_object($DB->d_result);

$storyroot = "stories";
$storyurl = $row->url;

echo("<tr>\n");

if ($row->prof_stat == "1") {
    echo("<td class=\"info2_pane\"><div align=\"right\"><b>Author</b>: <a href=\"index.php?action=profile&id=$row->authorid\">{$row->name} (Professors' Bookshelf)</a>&nbsp;&nbsp;<b>Story</b>: $row->title&nbsp;&nbsp;<b>Chapter</b>: $row->chaptitle");
} else {
    echo("<td class=\"info2_pane\"><div align=\"right\"><b>Author</b>: <a href=\"index.php?action=profile&id=$row->authorid\">$row->name</a>&nbsp;&nbsp;<b>Story</b>: $row->title&nbsp;&nbsp;<b>Chapter</b>: $row->chaptitle");
}

$query = "SELECT chapno, chaptitle FROM sq_stories WHERE storyid = $storyid";
$result = $DB->sq_query($query);

$no_chaps = mysql_num_rows($DB->d_result);

echo("<form action=\"read.php\" method=\"get\">\n");
echo("<input type=\"hidden\" name=\"storyid\" value=\"$storyid\">\n");
echo("<select name=\"chapno\" size=\"1\">\n");

while ($row = mysql_fetch_object($DB->d_result)) {
    if ($row->chapno == $chapno) {
        echo("<option value=\"$row->chapno\" SELECTED>$row->chaptitle</option>\n");
    } else {
        echo("<option value=\"$row->chapno\">$row->chaptitle</option>\n");
    }
}

echo("</select>\n");
echo("<input type=\"submit\" value=\"Go\">\n");

if (($no_chaps > 1) && ($chapno < $no_chaps)) {
    $nextChap = $chapno + 1;

    echo("<input type=\"button\" value=\">>\" onclick=\"goNext({$nextChap})\">\n");
}

echo("</form>\n");
echo("</div>\n");
echo("</td>\n");
echo("</tr>\n");

?>

jennielevineknies commented 10 years ago

Had a few minutes to do some more investigation today. All of the text of the actual stories is located in a directory on the server (stories) and then the story chapters themselves are saved as HTML files based on story ID from the mySQL database. So, the URL for the text of the story referenced above would be: http://sugarquill.net/stories/661_2.html. That works. What you lose is site branding, navigation, etc. So I suppose what I was thinking was to figure out a way to pull that static HTML into the page frame, and rewrite the navigation code so that instead of depending on the PHP, it would simply be an HTML link to the next chapter. So, in other words, when you get to the first chapter, instead of querying the database, the pull-down code would look something like:



Author: Arabella (Professors' Bookshelf)  Story: Hermione, Queen of Witches, Book One  Chapter: Chapter One

However, even as I write this, I realize the complexity of this, and that I'd basically have to turn the entire site static ...

And I'm also trying to get the HTML to show up as code and not formatted in this editor and it's not working!!

spurioso commented 10 years ago

@jennielevineknies You can have HTML show up as code in the editor by enclosing it in "fences" (three backticks or graves), like this:

<html>
<body>
</body>
</html>

And then it will come out like this:

<pre><code>
<tr>
<td class="info2_pane"><div align="right"><b>Author</b>: <a href="index.php?action=profile&id=1">Arabella (Professors' Bookshelf)</a>&nbsp;&nbsp;<b>Story</b>: Hermione, Queen of Witches, Book One&nbsp;&nbsp;<b>Chapter</b>: Chapter One<form action="read.php" method="get">
<input type="hidden" name="storyid" value="661">
<select name="chapno" size="1">
<option value="1" SELECTED><a href=
"/stories/661_1.html">Chapter One</al></option>
<option value="2"><a href="/stories/661_2.html">Chapter Two</a></option>
</select>
<input type="submit" value="Go">
<input type="button" value=">>" onclick="goNext(2)">
</form>
</div>
</td>
</tr>
</code></pre>

spurioso commented 10 years ago

Ben and Jennie will continue to work together on this one.

spurioso commented 10 years ago

Also, @jennielevineknies, you can spiff up the code you add to comments by adding the language being used like this:

<html>
<head>
</head>
<body>
</body>
</html>

which would result in:

pre><code>
<tr>
<td class="info2_pane"><div align="right"><b>Author</b>: <a href="index.php?action=profile&id=1">Arabella (Professors' Bookshelf)</a>&nbsp;&nbsp;<b>Story</b>: Hermione, Queen of Witches, Book One&nbsp;&nbsp;<b>Chapter</b>: Chapter One<form action="read.php" method="get">
<input type="hidden" name="storyid" value="661">
<select name="chapno" size="1">
<option value="1" SELECTED><a href=
"/stories/661_1.html">Chapter One</al></option>
<option value="2"><a href="/stories/661_2.html">Chapter Two</a></option>
</select>
<input type="submit" value="Go">
<input type="button" value=">>" onclick="goNext(2)">
</form>
</div>
</td>
</tr>
</code></pre>

umd-coding-workshop / website

Modifying website to prepare for webarchiving #5