karnov / htmltoword

Ruby html to word gem
MIT License
179 stars 71 forks source link

Add support for external images #44

Closed fran-worley closed 6 years ago

fran-worley commented 8 years ago

Adds basic support for images

Limitations:

Other changes: In order for relationship referencing to work I have appended links with Href and images with Image. Couldn’t find a better way to reference the numbers correctly.

Beginning of fix for #27

fran-worley commented 8 years ago

Until I can find a way to save source files in '/word/media/[image-filename]' I can't add proper support for images.

External images have the obvious drawback that they require an internet connection to load etc.

I will keep trying but if anyone has any bright ideas...

fran-worley commented 8 years ago

I have now got this working with internal images.

To work correctly your image source must be the full url, not just the relative path and you must provide the image size in pixels or EMs.

Size Size can be provided either via the style attribute:

<p><img src="http://placehold.it/250x100.png" style="width: 250px; height: 100px"></p>

or via data attributes

<p><img src="http://placehold.it/250x100.png" data-width="250px" data-height="100px"></p>

Should you provide both, data attributes take precedent.

Filename

The original filename of your image can either be inferred from the source:

<p><img src="http://placehold.it/250x100.png" data-width="250px" data-height="100px"></p>

would give a filename of: 250x100png

or you can provide a value via data attributes:

<p><img src="http://placehold.it/250x100.png" data-filename="what-a-lovely-image.png" data-width="250px" data-height="100px"></p>

would give a filename of: what-a-lovely-image.png

This is useful when the source url doesn't include the file extension or contains special characters.

Accessibility For accessibility you are recommended to provide titles for images via the alt attribute:

<p><img src="http://placehold.it/250x100.png" alt="Fancy image description" style="height:100px; width:250px"></p>

To do:

anitsirc commented 8 years ago

Great stuff @fran-worley! I'll review it during the week. Will you work on the todo or it was more like an informative todo?

fran-worley commented 8 years ago

I'm planning to address them with further PRs. However I would like to discuss points 2 &4 as they potentially impact the core of the gem. Also, I couldn't see any testing for document.rb which seems fairly critical give that the processing happens here.

anitsirc commented 8 years ago

@fran-worley I think there should be default dimensions when it's not defined in the image, or ignore and don't add it if not present. We shouldn't punish the users having images, generating empty or corrupted files, just because they haven't define a width and height.

fran-worley commented 8 years ago

Thanks for the feedback. I have a couple of questions...

Default Image size I agree that raising an error when no size is given is not ideal. In my opinion (happy to be wrong here...) a default image size isn't going to work as rendering any images at one size will make your document look awful. Either we don't show images at all without a size or we include a library like FastImage to calculate the size from source if the user doesn't specify one.

Supporting images in links This is more complicated as you don't appear to be able to nest relationships. I've pulled the xml that word generates when including images in links and it doesn't use relationships or hyperlink tags at all:

  <w:p w14:paraId="0533D16D" w14:textId="77777777" w:rsidR="00E55D2E" w:rsidRPr="000B537C" w:rsidRDefault="000B537C">

    <w:pPr>
      <w:rPr>
        <w:rStyle w:val="Hyperlink"/>
      </w:rPr>
    </w:pPr>

    <w:r>
      <w:rPr>
        <w:noProof/>
        <w:lang w:eastAsia="en-US"/>
      </w:rPr>
      <w:drawing><xsl:comment>Some lovely image xml</xsl:comment></w:drawing>
    </w:r>

    <w:r>
      <w:fldChar w:fldCharType="begin"/>
    </w:r>

    <w:r>
      <w:instrText xml:space="preserve"> HYPERLINK "http://www.example.com/" </w:instrText>
    </w:r>

    <w:r>
      <w:fldChar w:fldCharType="separate"/>
    </w:r>

    <w:r w:rsidRPr="000B537C">
      <w:rPr>
        <w:rStyle w:val="Hyperlink"/>
      </w:rPr>
      <w:t>Link Text</w:t>
    </w:r>
  </w:p>

I'm not sure why there is a difference in markup used by word and I can't find any documentation of this markup in the openoffice docs. If we want to support images then I'll need to rewrite the xml for links.

content types This was on my list as I didn't want to have to replace the entire file, but the files are corrupted if they don't contain the correct mimetypes.

We could open the file and inject the relevant default tags at the end of the Types tag as it doesn't appear to matter what order they are in.

Not tested yet but thinking something like this...

#replace current content_type code in #generate with this...
if entry.name == Document.content_types_xml_file
 out.write(inject_image_content_types(entry)) if @image_files.size > 0
end

#add a method to document.rb to inject the required content_types into the file...
def inject_image_content_types(file)
  doc = Nokogiri::XML(File.open(file)) 
  #get a list of all extensions currently in content_types file
  existing_exts = doc.xpath("/Default").map { |node| node.attribute("Extension") }.compact 

  #get a list of extensions we need for our images
  required_exts = @image_files.map{ |i| i[:ext] }

  #workout which required extensions are missing from the content_types file
  missing_exts = required_exts - (existing_exts & required_exts)

  #inject missing extensions into document
  missing_exts.each do |ext|
    doc.at_css("Types").add_child( "<Default Extension='#{ext}' ContentType='image/#{ext}'/>")
  end
  doc
end

@anitsirc Thoughts??

fran-worley commented 8 years ago

@anitsirc Any chance you can have a look at where I've got to. I'd love to get this merged soon...

jiek85 commented 8 years ago

Unrecognized unit of measure: .?

fran-worley commented 8 years ago

Currently you must provide a width and height in pixels or ems. You can do so either via the style or data attributes (data takes president should both be found)

If you don't provide a size or the size is in another unit (e.g percentage) you'll get an error.

tilsammans commented 8 years ago

@anitsirc any chance this can be merged? Would be lovely to have!

mojarra commented 8 years ago

@anitsirc any problems to merge this? It would be awesome

tilsammans commented 8 years ago

@karnov @nickfrandsen ping ... anyone please

stats commented 8 years ago

Hi, jumping into this conversation a little late.

On windows 7 64-bit ruby 2.3.1p112. Make a simple project:

gemfile

source 'https://rubygems.org'
gem 'htmltoword', git: 'https://github.com/fran-worley/htmltoword', branch: 'images-external'

testhtmltoword.rb

require 'htmltoword'

my_html = '<html><head></head><body><p>Hello</p></body></html>'
document = Htmltoword::Document.create(my_html)
file = Htmltoword::Document.create_and_save(my_html, 'test.docx')

running bundle exec ruby testhtmltoword.rb results in a test.docx file that cannot open in Word 2013.

The error is:

We're sorry. We can't open test.docx because we found a problem with its contents.

Details: The file is corrupt and cannot be opened.

Using the default htmltoword gem (0.5.1) the same test file produces a Word document that can be opened.

I used WinMerge to look for differences and the only difference I found was in the [Content_Types].xml file there is no data.

fran-worley commented 8 years ago

@stats can you attach your word document? I've just tried this and I can't open the document created from your html on either the base branch or the images branch.

stats commented 8 years ago

test-0.6.0.docx is your branch. test-0.5.1.docx is from the karnov/htmltoword master

For some reason your branch is not including the contents for [Content_Types].xml.

After some additional testing I think there there may be a problem if no image file is included in the document. In that case you will not get any content in the [Content_Types].xml file.

test-0.6.0.docx test-0.5.1.docx

fran-worley commented 8 years ago

Good spot @stats that should now be fixed via https://github.com/karnov/htmltoword/pull/44/commits/a12c1e45d6c0c04ea084e5af2ff0b08ba3e33ec4

stats commented 8 years ago

Awesome, tested and works. Thank you very much.

fran-worley commented 8 years ago

Do note the caviats in this branch.

1) images inside links do not render 2) all images must have their size width and height in pixels or ems (not %) 3) because images have to be downloaded and saved into the document it's not the quickest and you'll probably want to generate your docs in a background process.

dochoaj commented 7 years ago

@karnov are you planning to merge this?

filipkis commented 7 years ago

There is a small issue with current image naming functions in XSLT, that is if the source file was original.name.png and no data-filename was provided the file will be stored as word/media/image1.png, but in the relations file it will be referred to as:

<Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
  Target="media/image1.name.png" 
  Id="rId8"/>

In other words the transformation takes name.png as extension and not just png.

lukelex commented 7 years ago

Hi all,

The development around this project has been slow from our side mostly because it already fits all of our use cases. We're willing to reignite the work here and on board anyone that has already contributed to this project.

@fran-worley Sorry for the super late follow up on this.

fran-worley commented 7 years ago

@filipkis Good spot, I'll take a look at this and add to this PR.

@lukelex Good to here that you're looking at taking this on further if I can do anything to help get this branch merged let me know.

lukelex commented 7 years ago

@fran-worley I'm not an expert in XSL so for now just let me know when you feel confident about these changes. I'll then test it with our own stuff and gladly merge it.

eviofragoso commented 6 years ago

Hello, I'm trying to get images to work in the doc.

my tag is going like this: "

<img alt=\"\" src=\"/ckeditor_assets/pictures/1/content_equi.jpg\" style=\"height:334px; width:375px\". It seems to be in the proper syntax to the gem to work, but i'ts ignored.

kreintjes commented 6 years ago

@fran-worley @anitsirc Please see https://github.com/karnov/htmltoword/issues/71#issuecomment-398437050. The images won't show up in the Word files I create. Both the data-external images as well as the internal images simply won't show up. I can generate the Word document (don't get any errors), but the images aren't there.

Your sample code: <p><img src="http://placehold.it/250x100.png" alt="Fancy image description" style="height:100px; width:250px"></p> doesn't work either.

Could you help me with this?

jcat4 commented 4 years ago

@fran-worley @anitsirc Please see #71 (comment). The images won't show up in the Word files I create. Both the data-external images as well as the internal images simply won't show up. I can generate the Word document (don't get any errors), but the images aren't there.

Your sample code: <p><img src="http://placehold.it/250x100.png" alt="Fancy image description" style="height:100px; width:250px"></p> doesn't work either.

Could you help me with this?

Same here. I've tested back to version 0.7.0, and it's the same issue across each version. I'm not sure what would have changed to break this functionality, or when...