ruby-docx / docx

a ruby library/gem for interacting with .docx files
MIT License
431 stars 170 forks source link

Text Replacement not working as Expected #147

Open manovasanth1227 opened 9 months ago

manovasanth1227 commented 9 months ago

Describe the bug

Hi Team, I am looking to replace content or text that contains a specific pattern, such as {{sample}}. When I use the existing substitute method in the TextRun class, it appears to not be working.

Upon debugging the issue, I discovered that while reading the DOCX file, the nodes (specifically w:t elements) have been split into multiple tags. For example, if the word is {{vorname}},

paragraph.each_text_run => returns array of text_run object

For Easy Explanation I will represent array of text_run object as like this: ['{{', 'v', 'orname', '}}']. Consequently, when attempting to replace the text content using the substitute method, which checks the TextRun objects one by one, it misses the word that exists in the first place due to this splitting.

To Reproduce

Have content being edited multiple times in docx file, It will eventually break into multiple nodes as described above.

example

require 'docx'
path = "/Users/mb/Downloads/sample.docx" 
doc = Docx::Document.new(path)
doc.paragraphs.each_with_index do |paragraph, index|
  puts "paragraph: #{index} text: #{paragraph.text}"
  paragraph.each_text_run do |text_run|
    puts "text: #{text_run.text}" 
    text_run.substitute('{{vorname}}', 'Not Working Fine') 
# it fails to replace the placeholder when only a portion of the placeholder text is present, rather than the complete word.
  end
end

## output for the above code :
paragraph: 0 text: 24 Nov, 2023
text: 24 
text: Nov,
text:  2023
paragraph: 1 text: 
paragraph: 2 text: {{vorname}}
text: {{
text: v
text: orname
text: }}
paragraph: 3 text: 
paragraph: 4 text: {{nachname}}
text: {{
text: n
text: achname
text: }}
paragraph: 5 text: {{Stellentitel}}
text: {{
text: Stellentitel
text: }}
paragraph: 6 text: 
paragraph: 7 text: Subject: Test
text: Subject: 
text: Test
paragraph: 8 text: 
paragraph: 9 text: {{Name}}
text: {{
text: Name}}
paragraph: 10 text: {{asaso}}
text: {{
text: asas
text: o}}
paragraph: 11 text: {{Manovasanth}}
text: {{Manovasanth}}

Sample docx file

sample.docx

Expected behavior

Correctly replace the text placeholder text( {{vorname}}, {{Manovasanth}} ) with the given replacement text

Environment

manovasanth1227 commented 9 months ago

I have tried using replacing the text in paragraph level also. It changes the style to the existing text. So replacing the placeholder text in paragraph level is not a correct way.

doc.paragraphs.each do |paragraph|
   paragraph.text = paragraph.text.gsub('{{vorname}}', 'Not Working Fine')
end
ArisNance commented 9 months ago

Hi @manovasanth1227 is there any update or workaround you found?

manovasanth1227 commented 9 months ago

Hey @ArisNance. I have just overridden initialize method in paragraph class. Here, I replaced the corrupted text_run nodes with empty content and replace the correct placeholder text in any of the corrupted text_run nodes. Please note that it will work based on the regex pattern which matches any text enclosed with two curly braces starting - "{{" ending - "}}"

module Docx
  module Elements
    module Containers
      class Paragraph
        PLACEHOLDER_REGEX = /\{\{(.*?)\}\}/

=begin
  @param [w:body/w:p tag - Nokogiri Object] :node
  @param [Hash] :document_properties
  This method overrides the existing initialize in docx gem Paragraph class.
  We have called the validate_placeholder_content method which is responsible for
  correcting the corrupted text nodes in paragraphs.
=end
        def initialize(node, document_properties = {})
          @node = node
          @properties_tag = 'pPr'
          @document_properties = document_properties
          @font_size = @document_properties[:font_size]
          validate_placeholder_content
        end

=begin
  This method detect and replace the corrupted nodes if any exists.
=end
        def validate_placeholder_content
          placeholder_position_hash = detect_placeholder_positions
          content_size = [0]
          text_runs.each_with_index do |text_node, index|
            content_size[index + 1] = text_node.text.length + (index.zero? ? 0 : content_size[index])
          end
          content_size.pop
          placeholder_position_hash.each do |placeholder, placeholder_positions|
            placeholder_positions.each do |p_start_index|
              p_end_index = (p_start_index + placeholder.length - 1)
              tn_start_index = content_size.index(content_size.select { |size| size <= p_start_index }.max)
              tn_end_index = content_size.index(content_size.select { |size| size <= p_end_index }.max)
              next if tn_start_index == tn_end_index
              replace_incorrect_placeholder_content(placeholder, tn_start_index, tn_end_index, content_size[tn_start_index] - p_start_index,  p_end_index - content_size[tn_end_index])
            end
          end
        end
=begin
  This method detect the placeholder's starting index and return the starting index in array.
  Ex: Assumptions : text = 'This is Placeholder Text with {{Placeholder}} {{Text}} {{Placeholder}}'
      It will detect the placeholder's starting index from the given text.
      Here, starting index of '{{Placeholder}}' => [30, 55], '{{Text}}' => [46]
  @return [Hash]
  Ex: {'{{Placeholder}}' => [30, 55], '{{Text}}' => [46]}
=end
        def detect_placeholder_positions
          text.scan(PLACEHOLDER_REGEX).flatten.uniq.each_with_object({}) do |placeholder, placeholder_hash|
            next if placeholder.include?('{') || placeholder.include?('}')
            placeholder_text = "{{#{placeholder}}}"
            current_index = text.index(placeholder_text)
            arr_of_index = [current_index]
            while !current_index.nil?
              current_index = text.index(placeholder_text, current_index + 1)
              arr_of_index << current_index unless current_index.nil? 
            end
            placeholder_hash[placeholder_text] = arr_of_index
          end
        end
=begin
  @param [String] :placeholder
  @param [Integer] :start_index, end_index, p_start_index, p_end_index
  This Method replaces below :
    1. Corrupted text nodes content with empty string
    2. Proper Placeholder content within the same text node
  Ex: Assume we have a array of text nodes content as text_runs = ['This is ', 'Placeh', 'older Text', 'with ', '{{', 'Place', 'holder}}' , '{{Text}}', '{{Placeholder}}']
    Here if you see, the '{{placeholder}}' is not available in the same text node. We need to merge the content of indexes - text_runs[5], text_runs[6], text_runs[7].
    So We will replace the content as below:
      1. text_runs[5] = '{{Placeholder}}'
      2. text_runs[6] = ''
      3. text_runs[7] = ''
=end
        def replace_incorrect_placeholder_content(placeholder, start_index, end_index, p_start_index, p_end_index)
          for index in (start_index)..(end_index)
            if index == start_index
              current_text = text_runs[index].text.to_s
              current_text[p_start_index..-1] = placeholder
              text_runs[index].text = current_text
            elsif index == end_index
              current_text = text_runs[index].text.to_s
              current_text[0..p_end_index] = ''
              text_runs[index].text = current_text
            else
              text_runs[index].text = ''
            end
          end
        end
      end
    end
  end
end

Not sure if this will work in all cases. If you have any other thoughts on this solution, please share. @satoryu can you please help here ?

ArisNance commented 9 months ago

@manovasanth1227 thanks for the quick response and solution. it's a nice workaround and I can confirm it works for my need. I really appreciate it friend!

manovasanth1227 commented 8 months ago

@satoryu any update on this ?

guiferrpereira commented 6 months ago

@manovasanth1227 think you can try this instead of monkey patching it:

doc.paragraphs.each do |p|
  p.each_text_run do |tr|
    tr.substitute(tr.text, tr.text.to_s.gsub('{{vorname}}', 'Working Fine')) if tr.text =~ /\{\{vorname\}\}/i
  end
end