jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

adding extract_text_dir_sensitive #1040

Closed afriedman412 closed 4 months ago

afriedman412 commented 8 months ago

Fixes #848 (partially)

Adds extract_text_dir_sensitive function to .utils.text which lets the user specify which direction the lines and characters should be read.

Because the syntax is new, I didn't want to just alter extract_text_simple but I can integrate the two if that would be preferable!

jsvine commented 8 months ago

Thanks, @afriedman412. Could you provide a bit more context? Specifically, how does this differ from the horizontal_ltr= and vertical_ttb= parameters of .extract_words(...) and related methods? Are there specific scenarios that those methods can't handle, but the proposed additions can? (Etc.)

afriedman412 commented 8 months ago

OK so the idea here was to make it easier/more intuitive to manually control the direction in which text is read on both axes. If you have text that is rotated in some multiple of 90 degrees, you can just say "the words go right to left and the lines go top to bottom" and it will parse it correctly.

As I understand it, the horizontal_ltr and vertical_ttb params are specific to the direction of words but still presume the direction of the lines. More to the point, they both fundamentally control the same thing, so it's a little confusing to have them separate.

The bigger picture for me is that text direction on both axes is controlled the same way, by choosing which char parameters we use to group lines and words, and those can be cleanly parsed from asking two basic questions about the text orientation when calling the function.

Does that make more sense?

jsvine commented 8 months ago

Ah, got it! I do like the idea of being able to specify char/line direction with more granularity. In fact, I think this would be nice as a core part of the main extraction methods. Doing that will require a bit of code-surgery, so I'm going to take that on myself, but will credit you clearly.

Largely as a note to self, it sounds like there are a few different types of scenarios in which the reading direction of text on a page is not left-to-right, top-to-bottom:

afriedman412 commented 8 months ago

Ah, got it! I do like the idea of being able to specify char/line direction with more granularity. In fact, I think this would be nice as a core part of the main extraction methods. Doing that will require a bit of code-surgery, so I'm going to take that on myself, but will credit you clearly.

I can do it when I have time (if you haven't already). But I made the standalone function as a way to soft launch the syntax, with an eye towards full implementation whenever.

  • Pages that have been rotated. This feels like a nice-to-have but has been a lower priority; it seems like your suggestions here would sort of "automatically" handle that, which is nice.

Yeah I mean this issue is technically about issues parsing rotated text. And granted, a lot of this could be easily sorted upstream or downstream of text extraction, but it makes sense to put all text direction control in one place given how easy that is.

jsvine commented 6 months ago

Good news: I've made some progress on incorporating this more deeply into pdfplumber. There are still a few implementation details for me to iron out, but the general approach seems promising.

One wrinkle I realized: There are basically two variations of RTL text: (a) text that runs right-to-left for page rotation reasons (such as those in your examples) and (b) text in scripts/languages that naturally run right-to-left. In (a) the assumption, also reflected in the tests in this PR, seems to be that the user would want that text "fixed" in the output — i.e., for the output to read LTR. But for (b) I think it's safe to say that users would want the text to remain RTL.

I don't think there's an automated way to tell the difference between those two scenarios with high fidelity, so I'm planning to add two additional parameters — char_dir_output and line_dir_output, which would default to the values of char_dir and line_dir but allow the user to override that.

One other note: dir as an abbreviation has some potential ambiguity, as it's often used to refer to a "directory". I'm OK with using here as short for "direction", but figured it might be worth brainstorming other potential naming conventions. Open to suggestions.

afriedman412 commented 6 months ago

Thanks for the the input! Fully agree about using dir as a keyword.

I haven't gone back and looked at the code, but I think the idea was to infer the "reading" direction from the line direction. Anyways, new params is fine with me, although I would suggest something like natural_char_direction instead of output. Feels a little more intuitive.

jsvine commented 6 months ago

Thanks for the response! Re. natural_char_direction: Interesting, I find that less intuitive. In my mind, at least, it raises the question: "natural" to what? Keeping on riffing: What about render_char_direction / char_direction_render?

afriedman412 commented 5 months ago

Thanks for the response! Re. natural_char_direction: Interesting, I find that less intuitive. In my mind, at least, it raises the question: "natural" to what? Keeping on riffing: What about render_char_direction / char_direction_render?

yo in the interest of expediency im fine with whatever you think is best!

jsvine commented 4 months ago

Now added in https://github.com/jsvine/pdfplumber/commit/850fd45 and available in v0.11.0. Many thanks again!