USFM `\fig` properties written out as plain text

schierlm / BibleMultiConverter

Converter written in Java to convert between different Bible program formats

Other

124 stars 33 forks source link

USFM `\fig` properties written out as plain text #68

Open shadow-light opened 1 year ago

shadow-light commented 1 year ago

I see some USFM tags like \fig aren't supported, which is understandable. Though at the moment they appear to be written out as plain text:

\v 6 Na omiu hinage Loma Yaubada bada iyogaegomiu'o ta omiu Yeisu Besinana yana boda ainai ammiyamiya. \fig Rom 1.7|src="hk00237b.tif" size="col" ref="Loma 1:6" \fig*

goes to:

<verse number="6" style="v" sid="ROM 1:6"/>Na omiu hinage Loma Yaubada bada iyogaegomiu'o ta omiu Yeisu Besinana yana boda ainai ammiyamiya. Rom 1.7|src="hk00237b.tif" size="col" ref="Loma 1:6"<verse eid="ROM 1:6"/>

Is there an option to just exclude them entirely?

(example is from Romans 1:6)

schierlm commented 1 year ago

Due to the fact that tag structure in USFM is not as unified as you might like, it would require special casing for the "unsupported" tags. Currently, everything from the backslash to the first non-letter character is skipped. For \fig, you would probably want to skip to the next \fig*, yet for \qs-s you will probably skip to either the following \* or to the \* after the next \qs-e.

That being said, I am not interested in implementing these (or any other) advanced USFM tags, not even to the extent to improve skipping them. If anyone wants to tackle this, patches are of course welcome.

Rolf-Smit commented 1 year ago

I may be able to work on this, I have some custom software for my Bible app that can handle these open and close tags, including nesting. However I'm not sure how much work it would be to get that I to this.

I also have a USFM tool that I call a cleaner/normalizer, that is able to filter out these tags from USFM files.

Maybe I could opensource that and you (@shadow-light) would be able to filter them out, is that sufficient?

But I need some time to look into both things.

shadow-light commented 1 year ago

No problem, yes I'm also happy to help work on this if needed. If it is fairly simple I could do a PR for this tool. Or if it's complex, a tool to normalize USFM before passing to BibleMultiConverter could also be good.

shadow-light commented 1 year ago

To clarify on the lack of support for ca cp va vp fig fm. Is it just fig that gets written out? Or do the others get included as plain text as well?

It should be fairly trivial just to remove fig instances with a simple regex, so I'll just do that for my use case. I suppose it would be nice to support these tags for conversion to USX at least for completeness one day.

schierlm commented 1 year ago

The tag names themselves, i.e, \fig or \va get skipped. Their arguments get written out as plain text. And a warning is issued.

So, in the example from the documentation

\v 1 \va 3\va* Save me by your power, O God;

The number 3 will become part of the verse text of verse 1.