microformats / microformats-ruby

Ruby gem that parse HTML containing microformats/microformats2 and returns Ruby objects, a Ruby hash or a JSON hash
https://rubygems.org/gems/microformats
Creative Commons Zero v1.0 Universal
100 stars 29 forks source link

<br> tags are not interpreted as whitespace when converting HTML to plaintext #83

Open aaronpk opened 6 years ago

aaronpk commented 6 years ago

Similar to https://github.com/microformats/mf2py/issues/51 and https://github.com/indieweb/php-mf2/issues/69, the Ruby parser is stripping <br> tags rather than converting them to newlines when converting HTML to plaintext.

This is very apparent on Tantek's autoformatted posts. Compare the name and content.value for one of his posts:

dissolve commented 6 years ago

this is pretty tricky as this is how nokogiri does it, so it basically means rewriting html to text conversion :(

maybe another library does this better

jgarber623 commented 6 years ago

@aaronpk @dissolve I have a possible solution to this issue, but I don't know enough about the codebase to know where to make the changes.

Here's a bit of code that might be useful and/or spark some ideas. It uses the aforementioned page on Tantek's website and assumes we're only interested in .e-content. That's a narrowed use case for demonstration purposes, of course.

Save the following to a file (e.g. ~/to_plaintext.rb) and run ruby ~/to_plaintext.rb in a Terminal:

require 'net/http'
require 'nokogiri'

@doc = Nokogiri::XML(Net::HTTP.get(URI('http://tantek.com/2018/061/t2/improving-test-suite-home-pages')))

@doc.css('.e-content br').each do |node|
  node.replace(Nokogiri::XML::Text.new("\n", @doc))
end

puts @doc.css('.e-content').text

The output should look like:

Appreciate the explanation and link to the source file; makes sense.

However there is still a fundamental usability problem of the discoverability of how to file issues and suggested improvements for CSS module test suites.

I would like to suggest improving the generated test suite home pages themselves (e.g. http://test.csswg.org/suites/css-cascade-3_dev/nightly-unstable/) to link directly to https://github.com/w3c/web-platform-tests/ and suggest searching it for any source files one might want to file issues (or contribute patches) for, as you demonstrated in your comment (which I will do shortly for the cascade-import-002.htm source file specifically, thanks for the pointer. Update, done: https://github.com/w3c/web-platform-tests/issues/9910).

Note: the existing text of "More information on the contribution process and test guidelines is available on the wiki page." is not really useful, as the "wiki page" that is linked to (http://wiki.csswg.org/test) has A TON of links (a maze of passages that all appear alike if you will), none of which contain the precisely useful advice that you gave in your comment! Nor is it readily obvious how to fix http://wiki.csswg.org/test as it seems to serve many purposes, and the two likely links "How to Contribute" and "Reviewing Tests" both say on their pages: "This page has been deprecated and is no longer being maintained." with a top-level link to http://web-platform-tests.org/ which is also not useful, and that's already three clicks deep (if you guessed right which links to click) with still no answer as to how to contribute to this specific module's test suite.

Where should I file an issue and/or patch for the template or generation of the home pages of CSS module test suites like the specific page http://test.csswg.org/suites/css-cascade-3_dev/nightly-unstable/?

Thanks!

☝️ Note the line breaks which under-the-hood are \n characters. Success!

That's a pretty gnarly bit of code to drop in a bunch of places, so it might be worth adding a to_plaintext method (private or not) in one of the gem's classes (FormatParser, maybe?) so it can be used more frequently (akin to to_hash, to_json, etc.).

What do you think? Seem like a workable solution?