lukeme / gobible

Automatically exported from code.google.com/p/gobible
1 stars 0 forks source link

Handling multiple whitespace in USFM source text #139

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
GoBibleCreator should automatically reduce multiple whitespace found in USFM 
source text.

Michael Johnson observes:

"Multiple white space is not a problem, according to the USFM specification, a 
line end is the same as a space, and multiple white spaces in the text are the 
same as a space. Some white space is part of the markup, and not part of the 
text, as is white space following any start tag, which is really part of the 
tag and not part of the text. For example, "word \add word\add* word" or 
"word\add  word\add* word" both have exactly one significant white space 
between each of the three words, because the space after \add is part of that 
tag, and not part of the text."

cf. Go Bible Creator 2.4.0 just treats multiple whitespace as is, particularly 
where this is a residue after removing certain marker pairs.

Original issue reported on code.google.com by DFH...@gmail.com on 11 Oct 2010 at 4:44

GoogleCodeExporter commented 8 years ago
Does this mean that USFM cannot be used to mark out added text in agglutinative 
languages? (There is a possibility that part of a "word" is added, but not the 
whole word.)

Original comment by daniel.s...@gmail.com on 23 Nov 2010 at 1:40

GoogleCodeExporter commented 8 years ago
Daniel,

I had to look up http://en.wikipedia.org/wiki/Agglutinative_language

Even now I'm none the wiser why you asked the question!

David

Original comment by DFH...@gmail.com on 23 Nov 2010 at 4:17

GoogleCodeExporter commented 8 years ago
I guess I got the terminology wrong. It should be conjugation. But whatever it 
is...

I'm just wondering, in languages that often join words together (e.g. German) 
or languages where suffixes are significant (e.g. in Latin where suffixes to a 
verb indicate the subject, see 
http://en.wikipedia.org/wiki/Latin_conjugation#Personal_endings), whether the 
text between \add ... \add* might need to be _part_ of a word rather than a 
word on its own.

Using Latin as an example (with the help of Wikipedia!)
portem = first person ("I") "to carry"
portet = third person ("he") "to carry"

if the subject was not indicated in the original manuscripts, how would the 
translation indicate that the -em or -et were added in?

Original comment by daniel.s...@gmail.com on 23 Nov 2010 at 5:29

GoogleCodeExporter commented 8 years ago
That's an interesting point!

I'm not a world expert on USFM, but it would be fascinating to explore this 
with other CrossWire volunteers.

Of course, many Bible translations do not attempt to designate words [or parts 
of words] that are added by the translators in order to render the Hebrew or 
Greek more intelligibly in the target language.

Original comment by DFH...@gmail.com on 23 Nov 2010 at 8:58

GoogleCodeExporter commented 8 years ago
The main point of this issue is to improve Go Bible Creator such that: 
(after parsing & processing all the USFM tags)

# Any extra leading spaces (after the verse tag) are automatically removed,
# Each instance of multiple whitespace is replaced by a single space,
# Any trailing spaces at the end of a verse are removed.

I don't see this has any implications for special cases (cf. the above 
discussion).
For languages that don't use ordinary ANSI spaces to separate words, it should 
have no impact.

Original comment by DFH...@gmail.com on 28 Nov 2011 at 3:55

GoogleCodeExporter commented 8 years ago
This issue is covered by the following setting in USFMSettings.txt which is 
included in Go Bible Creator version 2.4.5

SignificantWhitespace: false

...

// Determines whether multiple spaces in scripture text are collapsed
// into a single space (a la HTML)
// false -- Multiple spaces are collapsed
// true -- Every space is honoured, except those after tag openings
// SignificantWhitespace: false

Original comment by DFH...@gmail.com on 31 Dec 2012 at 2:20

GoogleCodeExporter commented 8 years ago
This issue may be closed once version 2.4.5 has been released.

Original comment by DFH...@gmail.com on 31 Dec 2012 at 2:21

GoogleCodeExporter commented 8 years ago

Original comment by DFH...@gmail.com on 3 Jan 2013 at 10:35