inukshuk / bibtex-ruby

A BibTeX library, parser, and converter for Ruby.
http://inukshuk.github.com/bibtex-ruby
GNU General Public License v3.0
156 stars 31 forks source link

Multiline strings should be turned into single lines #39

Closed JLimperg closed 12 years ago

JLimperg commented 12 years ago

At the moment, bibtex-ruby returns multiline strings as-is, i.e. as a string with newline characters. This is probably not what most people would expect, and I can imagine no use cases where this behaviour would be desired. Therefore bibtex-ruby should in my opinion join multiple lines together and create a single line.

Example:

require 'bibtex'

BIB = <<END
@article{stuff,
  title = "This very long title must be wrapped and continues
           on the next line. bibtex-ruby should probably
           join everything together and create a single line though."
}
END

bib = BibTeX.parse(BIB)
title = bib['stuff']['title']

puts title
puts
puts title.gsub(/\n+\s*/m, ' ')

Output:

This very long title must be wrapped and continues
           on the next line. bibtex-ruby should probably
           join everything together and create a single line though.

This very long title must be wrapped and continues on the next line. bibtex-ruby should probably join everything together and create a single line though.

By the way, I forgot to say thanks for this awesome library in the previous issue so let's do it now. My own little project would have been completely impossible without it.

inukshuk commented 12 years ago

Hmmm, let's mull over this a little bit. If we do this by default you could not have explicit newlines in abstracts for example. Are there any other fields that may require newlines? Perhaps we could add this as an option or as a filter sort of like the latex decoder. What do you think?

Thanks for using the library and reporting issues – that improves the quality for all of us!

JLimperg commented 12 years ago

Ah, true. Didn't think of abstracts and such. I suppose there are three layers of complexity at which this problem can be addressed:

  1. Leaving things as is, which means that users must simply decide for themselves how they want to deal with the problem. In that case, a hint in the documentation would be cool, and a way to change Value strings directly without using the Filter class could come in handy. (Hope I didn't overlook one.)
  2. Implementing a parser option that globally turns on/off newline treatment. That wouldn't help a whole lot, as the need to make custom rules for certain fields would still remain, but at least it would allow to define the default behaviour.
  3. Introducing a (modifiable) blacklist of fields for which newlines should not be processed, including abstract, and removing newlines and whitespace from any other fields. Obviously this is the most complete solution, but also the most complicated and code-cluttering one.

In any case I would suggest an option to remove all whitespace behind a newline character, and I would make this the default behaviour. While I see that preserving newlines can be required, almost nobody would want to preserve the indentation from the BibTeX file.

inukshuk commented 12 years ago

After giving this some more thought, I've come around to the conclusion that we should filter line breaks by default. You're absolutely right that this behaviour is what you would typically expect. I suggest we make this a lexer option; additionally we could write a simple filter for this that allows you to exclude given fields. That way, we have a useful default behaviour with little performance penalty; and if you have special requirements (i.e., fields containing structured data relying on line breaks), you can turn it off and use the filter instead. What do you say?

Would you be interested in implementing this? I could tackle the lexer option today, but it would be awesome if you could add tests and the filter.

Also, I just saw rcite and I'm curious to know what are your plans going forward? I'm in the process of completely rewriting citeproc-ruby; haven't published it yet, but basically, I extracted a CSL library from it, and was actually considering to design a DSL to easily define 'simple' citation styles. Internally, they would be turned into full CSL styles, but without the overhead. Perhaps this is something that would interest you?

inukshuk commented 12 years ago

I added the lexer option. Could you add your example above as a cucumber feature and test if it works as expected? (e.g., in features/issues/multiline_strings.feature)

JLimperg commented 12 years ago

Feature written but somehow doesn't work: the newline and following whitespace don't get collapsed. I have no idea why, actually...

As far as RCite and your citeproc library are concerned, it seems like we're doing quite similar things. My intention is to write something like BibLaTeX but without the clumsy style syntax that drives me nuts whenever I try to make some little adjustment to one of the standard styles. Instead, the plan is to use pure Ruby for styles, making them human-readable and giving authors all the powerful tools at the same time. As you can see in the Teststyle, the syntax is already rather straightforward, but I'm still looking to improve it a bit further -- some parts don't look very clear yet. After that, I have about a million features that could be implemented on top of the basic thing...

Unfortunately I don't like CSL at all, mainly because of the fact that it uses XML for styles, making them really hard to understand. Therefore I'm probably not in a position to help a lot with your DSL idea, although I definitely could imagine it being quite helpful (as well as loads of coding work ;)).

inukshuk commented 12 years ago

Can you open a pull request with the feature so I can try and see what's going on there?

As regards RCite/CSL I fully understand where you're coming from; I sort of started out similarly but completely underestimated the complexity of citation styles (truth be told, I still can't quite fathom why they are that complicated). Having said that, you should be cautious going forward to keep RCite simple and functional as you add features because it gets complicated fast. For instance, what about localized and gendered ordinals? To have a style produce something seemingly simple like edition numbers you have to solve many problems, not to speak of the translations themselves.

CSL gives you a huge repository of styles and locales to start with; it let's you do all the complex stuff, but the overhead for simple requirements is too high. That's why I've been contemplating writing a library that makes the simple stuff easy (and makes the XML transparent) but is still compatible with CSL. Anyway, all that is besides the point I guess. But keep it mind. Meanwhile, let me know when there's anything we can do with bibtex-ruby to support RCite.

inukshuk commented 12 years ago

Strange. Your feature works fine over here. What error are you seeing?

JLimperg commented 12 years ago

bundle exec cucumber doesn't work at all, throwing a LoadError:

no such file to load -- bibtex/name_parser (LoadError)
/home/jannis/tmp/bibtex-ruby/lib/bibtex.rb:62:in `require'
/home/jannis/tmp/bibtex-ruby/lib/bibtex.rb:62:in `<top (required)>'
/home/jannis/tmp/bibtex-ruby/features/support/env.rb:1:in `require'
/home/jannis/tmp/bibtex-ruby/features/support/env.rb:1:in `<top (required)>'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/rb_support/rb_language.rb:129:in `load'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/rb_support/rb_language.rb:129:in `load_code_file'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/runtime/support_code.rb:171:in `load_file'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/runtime/support_code.rb:83:in `block in load_files!'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/runtime/support_code.rb:82:in `each'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/runtime/support_code.rb:82:in `load_files!'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/runtime.rb:174:in `load_step_definitions'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/runtime.rb:40:in `run!'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/cli/main.rb:43:in `execute!'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/lib/cucumber/cli/main.rb:20:in `execute'
/var/lib/gems/1.9.1/gems/cucumber-1.1.4/bin/cucumber:14:in `<top (required)>'
/var/lib/gems/1.9.1/bin/cucumber:19:in `load'
/var/lib/gems/1.9.1/bin/cucumber:19:in `<main>'

You might want to try out require_all; this gem magically solved all my require problems. ;)

Without the bundle exec part, cucumber --name 'Multiline' runs fine but reports:

Feature: Multiline values on single lines
  As a hacker who works with bibliographies
  I would find it practical to have multiline values turned to single lines by default
  Because this is what most BibTeX files will assume.

  Scenario: Multiline value                                                                                                                           # features/issues/multiline_strings.feature:6
    When I parse the following file:                                                                                                                  # features/step_definitions/bibtex_steps.rb:5
      """
      @article{stuff,
        title = "This very long title must be wrapped and continues
                 on the next line."
      }
      """
    Then the entry with key "stuff" should have a field "title" with the value "This very long title must be wrapped and continues on the next line." # features/step_definitions/bibtex_steps.rb:93
      Expected "This very long title must be wrapped and continues on the next line.", not "This very long title must be wrapped and continues\n           on the next line.". (MiniTest::Assertion)
      ./features/step_definitions/bibtex_steps.rb:94:in `/^the entry with key "([^"]*)" should have a field "([^"]*)" with the value "([^"]*)"$/'
      features/issues/multiline_strings.feature:14:in `Then the entry with key "stuff" should have a field "title" with the value "This very long title must be wrapped and continues on the next line."'

Failing Scenarios:
cucumber features/issues/multiline_strings.feature:6 # Scenario: Multiline value

1 scenario (1 failed)
2 steps (1 failed, 1 passed)
0m0.005s

I have run everything on a fresh clone of your repo.

inukshuk commented 12 years ago

Right, you need to build the parsers first. Try to run:

$ bundle exec rake racc

And you should be good to go.

inukshuk commented 12 years ago

OK, I just added a separate filter, so I think we now have what we wanted. By default, linebreaks should be stripped. If you need more control you can turn it off:

 b = BibTeX.parse source, :strip => false
 b.convert :linebreaks

This allows you to filter selectively, too, for example, something like (untested):

entry.convert :linebreaks { |field, value| field != :abstract }
JLimperg commented 12 years ago

All right, works. Sorry for the confusion. So, I guess this issue can be closed then. If I find some time I may extend the lexer in order to implement the whitelist/blacklist approach mentioned above and just send you a pull request.

Off Topic: I'm well aware of the high complexity of those citation styles, therefore I'm currently trying to find a general approach that is very flexible, easy to understand and open enough to let me add new features without much change to the existing code. Needless to say this is not the simplest task in the world...

inukshuk commented 12 years ago

Great!

The lexer isn't really aware of context (field names etc.); besides, isn't our solution not flexible enough as is?

JLimperg commented 12 years ago

Yes, it's pretty flexible already, but the list approach could maybe make it a bit more convenient. That said, the difference certainly wouldn't be too big, so I'll just see. If there is a clean and easy way to implement it, that would be cool, and if not, it's not necessary either.