I'm testing the contents of a PDF generated by PDFKit. When I run Text.analyze_file([MY_PATH]).strings on the file I get an array which holds each character of the PDF content in it's own index. Spaces are stored as '' (empty string). I've been able to move forward by replacing all empty strings with a space character. However, I'm now up against content which contains new line characters. The new lines are not stored in the array, so the separation between the words is lost around the new line character. Ever see this sort of behaviour? I realize that there are a number of factors which could be screwing things up, including my own ignorance, and I'd love to find the root of the problem, but I have no time. Right now, I'd be happy with a hack to get my tests working.
Cheers!
EDIT: So I came up with a hack that'll get me through. I remove all the white space characters from the array (they weren't actually empty strings, as I had believed). Then join the characters with exactly one space, and downcase the whole thing.
def char_array_to_normalized_string(arr)
arr.delete_if{|s| s =~ /\s/ }.join(' ').downcase
end
After I put my test strings through the same process, by calling char_array_to_normalized_string("Test String".scan(/./)), I'm able to match them against the ouput of PDF inspector. It's not pretty, but it gets me where I need to go.
Cheers!
I'm testing the contents of a PDF generated by PDFKit. When I run Text.analyze_file([MY_PATH]).strings on the file I get an array which holds each character of the PDF content in it's own index. Spaces are stored as '' (empty string). I've been able to move forward by replacing all empty strings with a space character. However, I'm now up against content which contains new line characters. The new lines are not stored in the array, so the separation between the words is lost around the new line character. Ever see this sort of behaviour? I realize that there are a number of factors which could be screwing things up, including my own ignorance, and I'd love to find the root of the problem, but I have no time. Right now, I'd be happy with a hack to get my tests working. Cheers!
EDIT: So I came up with a hack that'll get me through. I remove all the white space characters from the array (they weren't actually empty strings, as I had believed). Then join the characters with exactly one space, and downcase the whole thing.
def char_array_to_normalized_string(arr) arr.delete_if{|s| s =~ /\s/ }.join(' ').downcase end
After I put my test strings through the same process, by calling char_array_to_normalized_string("Test String".scan(/./)), I'm able to match them against the ouput of PDF inspector. It's not pretty, but it gets me where I need to go. Cheers!