Improve import capabilities

drlippman commented 10 years ago

A set of proposed enhancements to the import capabilities.

1) Rewrites some reserved-word IDs, like "toc", in imported content, to prevent conflict with navigation items.

2) Adds an extracted page title to the content list for EPUB imports, replacing the manifest ID. This makes it easier to tell which content is which.

3) Adds a "part" content type selector to the import for IMSCC, and selects any modules as "parts" by default. When importing, any chapters following a part will get added to that part. This should allow import from IMSCC retaining the module structure as a part structure. Also removes the pre formatted var dump on module title content, since that isn't very useful.

4) Adds citation and licensing options on the content import content selection page, applying those selections to all imported content. This will greatly simplify applying appropriate citation and licensing when importing whole books.

There's still a big TO-DO for imports around dealing with WRX imports, since that's our current only way to move content from one site to another. Currently part structure, citations, and licensing are all lost on import in WRX.

drlippman commented 10 years ago

I just updated the pull request to add:

5) Display the Wordpress import in table-of-contents order. Add a "part" content type selector to the import, so parts can get re-imported as parts, and all the subsequent chapters will get added to that part, allowing a full export/import retaining structure. Also updates the Candela citation and license fields from the WRX if present.

billfitzgerald commented 10 years ago

Hello, David, First off, thank you for this work. We are in the process of updating the test site for review of the updated courses, and getting ready for the fall rollout. We'll review the pull requests once this is in place, and go from there. A couple quick notes (based on a quick read through the descriptions and the code):

There is some minor overlap between what you have done here, and what we have already done, esp related to handling parts. There are also some WP/PB oddities with Parts that we have been careful to work around with imports and exports.
I'm not entirely sure the problem you are trying to solve, and why the approaches you took make the most sense, both short term and long term. Documenting the use case in more detail, and then documenting why a specific approach made the most sense, would help clarify the problems that are getting attention here.
From a process place, rather than a single monolithic pull request, this should be multiple smaller, focused pull requests. Smaller pull requests, paired with solid documentation as described above, allows us to review each request individually, as opposed to in one large, relatively undifferentiated block.

drlippman commented 10 years ago

As for 1), is the work you've already done available somewhere? I had assumed this repo had the current version of everything. I certainly don't want to duplicate work you've already done.

As for 3), since many of the changes I made were interconnected, trying to do them as all smaller pull requests would be challenging. I did try to include in the pull request a number of individual commits, each adding one specific part of the puzzle.

As for problems I was trying to address:

1) There have been a number of cases where content we've imported has had an HTML element with id="toc". Because this is also the ID of one of the elements of the pop-out table of contents, having the element in the content broke the TOC from displaying correctly. One solution would be to change the ID of the TOC element to something more unique, but as a simpler temporary fix, this change looks for some already-used IDs when importing content, and prefixes them so they won't conflict.

2) When importing EPUBs, the "Title" column previous only displayed the page ID from the manifest file, which was usually obscure and not useful. For example, in the OpenStax books, it would often look like "m1235152823". This fix extracts the title from the HTML file, and displays that instead. This is a simple fix, and should be fine in the long-term.

3) Previously, when importing IMSCC files, which have an explicit Module structure, all structure was lost on import. This results in a large amount of work: removing the "chapters" that are nothing more than module titles, creating new Parts to house the modules, and rearranging the chapters into those modules. This fix attempts to address that by allowing the Modules from the IMSCC to be imported as parts, and for the chapters that follow a Part to be assigned to that part. This is about the best that can probably be done in terms of squishing IMSCC-structured content into the Pressbooks Part/Chapter format. The approach I used to identify Modules is a bit hacky (it would be better to walk the XML DOM tree), but I didn't want to rewrite all the xpath-based code that had already been written.

4) When importing an entire book (like an OpenStax EPUB, a Lardbucket text that we've processed into an IMSCC file, etc.), having to then go add citations and license to every single page of the book individually is really tedious. This fix addresses that by allowing the user to provide citations and/or license when importing, and have those selections apply to every page imported. This seems to me like the most sensible way to get a license onto a bunch of imported content at the same time.

5) Previously, when import a Wordpress XML file that was an export of a Pressbooks book, none of the parts were imported, and all the citations and license were lost. Since the export/import is (I'm guessing) the only way we'll have to move content from one instance to another, it would be nice if the import could be as identical to the original as possible. This fix adds the ability to import parts, and imports the citation and license. The approach used mimics that for IMSCC: on the content selection page, display all the content in the order it appears in the table of contents. Each item can be designated as a chapter or part (or front-matter or back-matter), and their exported-type will be selected by default. There may be a better way to do this, perhaps by removing the "type" selection entirely when importing a Pressbooks export, but this is at least a short-term fix.

drlippman commented 10 years ago

Closing this, as I'm going to try to submit some of these in smaller chunks, as requested.

lumenlearning / candela

Improve import capabilities #10