jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.39k stars 3.37k forks source link

All docx lines annotated with {dir=“ltr”} #5723

Closed fmatheus closed 4 years ago

fmatheus commented 5 years ago

Even the empty lines of a google docs download docx are annotated when converting from docx to markdown as default behavior. I think it should not be the default, or at least should be easy to disable, since it brakes previous behavior. This was not a bidi file, all lines are ltr:

[test line]{dir=“ltr”}

[]{dir="ltr"}

pandoc 2.7.3 Compiled with pandoc-types 1.17.6, texmath 0.11.2.2, skylighting 0.8.2

Seems related to #5545.

jgm commented 5 years ago

Just tried with a docx generated by Google Docs. Sure enough, it is riddled with <w:rtl w:val="0"/> elements. I'm not sure why Google does this, but I suppose we'll have to work around it.

jgm commented 5 years ago

@jkr any ideas about how to handle this?

Basically we should only insert the "dir=ltr" span when the pPr or enclosing sectPr includes <w:bidi/>. The default is ltr.

jkr commented 5 years ago

Hmm... I couldn't reproduce with a google docs download. But, yeah, only doing it with bidi seems right. The issue is that we can't do just ignore it if it's zero or "ltr" because it might be an ltr section in a rtl doc.

@jgm, would it be possible to post the google doc you downloaded?

jgm commented 5 years ago

I just opened a doc, typed a few lines, and downloaded as docx. Here's an example. Untitled.docx

jgm commented 5 years ago

Converting this to markdown yields

[Hi]{dir="ltr"}

[]{dir="ltr"}

[This is a brand new document.]{dir="ltr"}

[]{dir="ltr"}
jkr commented 5 years ago

Sorry -- I had forgotten that I had written the part in the writer, but not this part in the reader. Which explains why racking my brain trying to remember my thinking was coming up empty.

I imagine it's safe to just ignore the RTLs unless they're inside of a bidi prop. Even then, for the sake of correctness, we should only know whether they reverse direction (this clearly doesn't, since it's a zero). I'll take a look at the spec and see if there's anything illuminating in there, but for now, it seems like the safe thing is to have a envBidi toggle (or perhaps triple: Just LTR, Just RTL, Nothing) and only pay attention to these if the toggle is on.

jgm commented 5 years ago

I agree, I think that's the thing to do.

<w:bidi> can appear either in sectPr or in pPr. Currently it doesn't look like we have anything that reads sectPr?

jkr commented 5 years ago

No, we don't read sectPr. Weirdly, sectPr comes as a child of the last paragraph in the section (except for the last section, in which case it's a child of body). Unfortunately, we don't have a native concept of section that we can translate this into -- first/last complications aside.

I think we could add one at the docx level (add a SectBreak type to the BodyPart algebraic type) but we'd have to move it to the beginning of the section area when we got to it.

This also means that we'd have to do a second pass to figure out what to do with all the rtl stuff. But we have to do that anyway to convert the Docx type to Pandoc, so we could take care of it there.

Okay, I'll give it a try, and see how it works.

jgm commented 5 years ago

Unfortunately, we don't have a native concept of section that we can translate this into -- first/last complications aside.

Div?

jkr commented 5 years ago

I meant in the docx type ("native" was a misstatement). Going docx type -> pandoc type is fairly easy, since we can just set an environment when we hit the section break. Divs would add noisiness for not much gain.

Sorry -- I was mainly trying to think it out from the docx-parsing side.

jgm commented 5 years ago

Any progress on this?

jkr commented 5 years ago

Yeah -- but the correct implementations of sections are the sticking point. It requires either a considerable change in parsing or in conversion. I've tried a couple of things. But if the release is waiting on it, I'll go with a good-enough approach and make it more elegant later.

jgm commented 5 years ago

Ah, I see. Well, I don't have a specific timeline for a release, so there's still time to work on it.

jkr commented 5 years ago

I've finished the implementation of sections and could push it soon. But, there's a punchline. Going back to the spec, where it discusses bidi at the section level:

This section direction is now right-to-left, which means that all section level properties are displayed > right-to-left (e.g., page numbers are displayed on the right of text; columns are populated from right-to-left). However, the layout of text is determined by properties applied at the text level. end example] This element’s content model is defined by the common boolean property definition in §17.17.4.

The section-level stuff doesn't actually effect text-layout.

So the question: should I keep the section layout at the Docx level, though it doesn't really effect pandoc generation? Right now, in my branch, instead of

data Body = Body [BodyPart]

(yes it should be a newtype) we have

data Body = Body [Section]
data Section = Section SectionStyle [BodyPart]

So it's fairly intrusive, but it finally works.

Question is, if it doesn't effect bidi what would we use it for? The only thing I could see would be giving the option to output your docx with sections as divs. What do you think? As I say, it's already implemented, so it doesn't really cost us anything (other than some negligible processing time and some complexity).

jkr commented 5 years ago

Oh -- the other thing I forgot to mention. Sections are actually fairly useful in the docx Writer since they allow you to start footnote numbering over at new chapters. I have a filter doing that with raw ooxml. But it might be a good idea to implement them in the writer for that use case. So having them in the reader would offer some symmetry.

jgm commented 5 years ago

Oh, that's interesting. I guess I jumped to conclusions about the effect of putting bidi in a sectPr.

Well, I'll leave it up to you whether to leave this in. If you think it might be useful at some point in the future, it probably doesn't hurt to have it there. On the other hand, it is some extra complexity and that always has a cost!

agusmba commented 5 years ago

Sections are actually fairly useful in the docx Writer ...

Are you referring in the end to section jumps in the final docx also? If so, that'd give me pause. Word document styling gets noticeably complicated once you start using sections. Unfortunately 95% of Word users don't know how to properly deal with them (I completely made that up). It's not that it's really complicated just a big hassle that can wreak havoc in your document if you're not careful.

jkr commented 5 years ago

Yeah, I am talking about the writer there. it might be better left for filters. I don't know. The two main use cases I come across, and need to use the filter to inject the ooxml for, are both for typesetting full book manuscripts in docx (which humanities publishers are still pretty insistent on -- I imagine some dissertation committees are pretty insistent on docx as well):

  1. Lower-case roman page-numbering for front-matter, and
  2. Restarting footnote numbering with each chapter

Both are essential, and both require sections.

But I can also see the point that sections will create unforeseen consequences if people aren't careful with their reference.docx. But that goes for a lot of things.

Anyway, this is off-topic for this change, which I'll do without sections. I'll probably put up something for discussion later. In the meantime, I'll try to publish my filter.

jkr commented 5 years ago

BTW, I've just been waiting for the style changes to stabilize on this.

jgm commented 5 years ago

Style changes have been merged now.

jgm commented 5 years ago

@jkr is something still blocking this? I'd be fine with a fix that just ignores the section stuff, for now.

jkr commented 5 years ago

Just a particularly insane semester -- and GHC's ever-slower compile times makes it harder to fit in between things. I'll commit to it by the end of the weekend.

jgm commented 4 years ago

@jkr I'm getting close to clearing the issues on the 2.8 milestone. It would really be nice to get this one in there: what is the state of things?

jkr commented 4 years ago

Compiling as we speak -- trying to get it done tonight. Should have an update in an hour or so. Sorry again for the state of my attention...

jgm commented 4 years ago

Excellent, thanks!

conor909 commented 4 years ago

I'm using a reference.docx file but the RTL formatting doesn't work at all, will the updates fix this ?

jkr commented 4 years ago

If you're using a reference.docx, you're probably taking about the docx writer. This was a bug in the docx reader.

If you do file a bug in the proper place, you will need to be more clear about what "doesn't work at all" means. What do you expect to happen? What actually happens? What version are you using? Including a sample input file and the reference.docx file you're using will also help.

conor909 commented 4 years ago

well after saving RTL formatting on text in my reference file, it isn't applying RTL formatting to to the text in my outputted file - but sorry for pissing in your cornflakes @jkr I'll ask somewhere else